March 13, 2015

Kernel Developers Summarize Linux Storage Filesystem and Memory Management Summit

kernel developers storage keynote

A group of three Linux kernel developers kicked off the Linux Foundation Vault storage conference on Wednesday morning by hashing out proposed changes to the kernel and the stack from the Linux Storage Filesystem and Memory Management Summit (FS&MM), which took place earlier in the week.

James Bottomley, CTO at Parallels, Jeff Layton, senior software engineer at Primary Data, and Rik van Riel, principal software engineer at Red Hat discussed a range of filesystem and memory management issues and solutions on stage, including OverlayFS and Union Mounts, nonblocking async buffered reads, powerfail testing, and more

As Bottomley noted the FS&MM conference began in 2006 as a storage summit. They combined with filesystems in 2007 and added memory management in 2009. The purpose from the start was to coordinate the activity at the stack level. It has worked to date: “Working with this corner of the stack unblocks a huge number of issues,” he explained.

In that time the conference has evolved into the premier forum to solve patch and architecture problems in the Linux Layer.  In fact, “it achieves resolutions in cases that would have taken months of arguing on the mailing lists,” Bottomley added.

5 Big Topics Tackled Off the Bat

For example, for this conference the group tackled a wide range of issues, such as OverlayFS and Union Mounts, probably the top issue of concern. For this topic the conference addressed:

  • d_inode field vs. d_backing_inode

  • /proc/pid/fd points to wrong inode

  • file locking + fanotify

  • pagecache and metadata coherency between r/o and r/w opens

  • block layer pagecache sharing

  • ksm-like page merging based on dev/inode.

Maybe the second biggest topic was nonblocking async buffered reads. This has been a long standing problem because async buffered reads can always block. One common workaround has been the use of thread pools. Another allows reads when the data is in pagecache. To further address this the conference introduced new syscalls:  preadv2 and pwritev2.

Powerfail testing came up early too. The idea was to come up with a way to test for catastrophic power failures without simply pulling the plug. The suggested solution: dm-flakey—a device mapper plugin that integrates with xfstest.

Another issue looked at generic VFS uid/gid remapping. This a problem with the user namespace container. It turns out that in some cases the UID and GID (user and group IDs) don’t properly reflect those in the container.  The solution is to allow UID and GID squashing in v4.0

Also announced were improvements with epoll and VFS Layer changes. VFS, for example, will now unmount on invalidation in v4.0. Usually a Linux file system cannot be unmounted when it is being actively used or at least thinks it is being used. Traditionally, a lazy unmount avoids this problem. Version 4.0 also makes improvements to lazy unmounts. In terms of file systems, the conference also went on to look at defrag improvements for block-based FS and NFS performance issues, specifically NFS latency under load.

More stack issues

The keynote team went on to race through a range of other predominantly stack issues, ranging from memory management to page sizes to 32k block and multiqueue and more. For example, heterogeneous memory management addresses the issue where memory is not all local. Also addressed was memory reservation for filesystem transactions.

Persistent memory is similar to RAM in speed, adding only the persistence attribute. Two questions arise: how to deal with new devices and how to use this new memory. Two likely answers: use as memory and use a block device. As a block device you would access it through ways applications already know how to use. Oh and by the way, persistent memory is fast but you can’t take snaps for backup.

Transparent huge pages offer interesting possibilities. By using 2MB pages instead of a 4kB page you experience a 5-15% performance improvement. Here the question arises how to use the increased page size with persistent memory.

Addressing IOPS

In terms of Filesystem and Block interfaces the team explored new I/O and FS paradigms. The decision was to enhance block and FS layers for upcoming hardware and infrastructure. They also looked at what it takes to support 32k block sizes. The 32k size is desired by disk vendors but not much liked by memory management and FS people.  This is a logical vs. physical question. However, using current RMW 512e layers you can achieve most of the performance gains for little outlay. In the end, the team pondered how to fudge it.

As far as SMR (shingled magnetic recording) goes the team is seeking one true representation. In general, however, Linux block-based IO has not been tuned as well as the network stack to support millions of IOPS or, as some analysts project, trillions of IOPS. FS extensions are under consideration but do alter the existing FS paradigm considerably. The dominant opinion seems to not change the status quo.

Concerning Multiqueue the team identified three problems:

  1. Polled i/o

  2. Driver apis

  3. I/O schedulers.

The decision: no I/O scheduler but going forward plan to integrate some form of I/O scheduler with multiqueue.

In terms of iSCSI performance the team faced a choice of MC/S or Session Groups. Neither were particularly appealing. MC/S requires standards updates. Session Groups code exists but was described as ugly. Session Groups appears to have won this one for now.

The upshot is that the Linux core team is attuned to the demands of storage and is sensitive to the tradeoffs that changes to the stack would require. Of course, doing nothing entails its own costs and risks so doing nothing isn’t an option for long either.  

Based on this Vault keynote presentation the Linux storage community’s varied interests are being thoroughly considered and generally addressed.

Click Here!