January 5, 2005

Understanding NetBSD 2.0's new technology

Author: Federico Biancuzzi

NetBSD is widely known as the most portable operating system in the world. It currently supports 52 system architectures, all from a single source tree, and is always being ported to more. NetBSD 2.0 continues the long tradition with major improvements in file system and memory management performance, significant security enhancements, and support for many new platforms and peripherals. To celebrate the release, we've asked several well-known NetBSD developers to comment on some of NetBSD 2.0's new features.

A complete list of the changes and new features in NetBSD 2.0 can be found in the "changes" guide, but here are the highlights:

  • Native thread support has been added, based on scheduler activations. Applications which support native threads can now take full advantage of the high-performance NetBSD POSIX threads implementation.
  • Kernel events notification framework: kqueue provides a stateful and efficient event notification framework. Currently supported events include socket, file, directory, fifo, pipe, tty and device changes, and monitoring of processes and signals. kqueue is supported by all writable file systems in the NetBSD tree (with the exception of Coda) and all device drivers supporting poll().
  • Improvements have been made to NetBSD's Linux emulation to support the latest Sun JDK/JRE for Linux. Testing has shown that it now runs as well as it does on Linux natively.
  • NetBSD 2.0 enforces non-executable mappings on many platforms. This means that the process stack and heap mappings are non-executable by default, making exploitation of potential buffer overflows harder. NetBSD 2.0 supports PROT_EXEC permission via mmap() for all platforms where the hardware differentiates execute access from data access, though not necessarily with single-page granularity. When the hardware has a larger granularity, the rule is that if any page in the larger unit is executable, then the entire larger unit is executable, otherwise the entire larger unit is not executable.
  • The i386 port now supports SMP and has a new ACPI and power management framework which takes advantage of Intel's ACPI implementation.
  • FreeBSD's UFS2 has been ported to NetBSD. UFS2 is an extension to FFS, adding 64 bit block pointers and support for extended file storage. Among other enhancements, UFS2 allows for file systems larger than 1Terabyte.
  • The systrace framework has been added to the system. systrace monitors and controls an application's access to the system by enforcing access policies for system calls. The systrace utility might be used to trace an untrusted application's access to the system. In addition, it can be used to protect the system from software bugs (such as buffer overflows) by constraining a daemon's access to the system. The privilege elevation feature of systrace can be used to obviate the need to run large, untrusted programs as root when only one or two system calls require the elevated privilege.
  • Verified Exec support has been added in this release. Verified Exec verifies a cryptographic hash before allowing execution of binaries and scripts. This can be used to prevent a system from running binaries or scripts which have been illegally modified or installed. In addition, Verified Exec can also be used to limit the use of script interpreters to authorized scripts only and disallow interactive use.

More information about the 2.0 release can be found in the release announcement, which is available in a variety of languages.

Interview with NetBSD developers

We interviewed NetBSD programmers Christos Zoulas, Luke Mewburn, Ben Collver, Nathan J. Williams, Jaromir Dolecek, Chuck Silvers, Hubert Feyrer, Bret Lymn, Jan Schaumann, Roland Dowdeswell, and Niels Provos. Here's what they had to say about some of the new features in NetBSD:

One of the reasons why this release has the new major number 2.0 is the introduction of SMP support. Which technologies have you chosen and developed?

Luke Mewburn: SMP (so-called "biglock" model) and kernel assisted userland threads, with the pthreads (POSIX threads) API.

How do they compare with FreeBSD 4.x and 5.x, DragonFlyBSD, and OpenBSD
SMP technologies?

Luke Mewburn: If I recall correctly, FreeBSD 4.x is biglock.
FreeBSD 5.x and DragonFlyBSD have taken different approaches at
solving the problems with biglock SMP. OpenBSD is biglock, with
various bits derived from an older NetBSD SMP codebase.

Are there any optimizations specific for some platforms, like i386 Intel
HyperThreading technology?

Luke Mewburn: We enable support for HT, and each virtual CPU is treated separately.
Our scheduler currently doesn't have any specific knowledge of HT and
therefore doesn't take advantage of HT-specific architectural issues
when scheduling processes on (virtual) CPUs.

Are the performance of single-cpu systems affected by the new code?

Luke Mewburn: Shouldn't be, unless you run an SMP-enabled kernel.

Native thread support has been added. Why the word "native?"

Christos Zoulas: They are "native" to the operating system, i.e. there are features
of the operating system the threads library uses that were
programmed specially to assist threads. They are native because they
come with the base operating system, instead of requiring a 3rd party
package to be installed. Finally the word native means that they
are really kernel threads, not just userland-based. In the NetBSD
implementation we have both userland and kernel threads (what is
called a n:m thread model).

Ben Collver: Releases prior to NetBSD 2.0 did not provide a threads API in the base
system. Technically, there were exceptions. For example, the clone()
system call existed under Linux emulation to run threaded Linux
applications. Userland threading implementations are sufficient to run
most threaded applications in older NetBSD releases. In pkgsrc we use
GNU PTH, and this is probably the most used implementation.

How did the old threads work?

Nathan J. Williams: Previously, there were no integrated threads. Applications that wanted
to use threads would link in a library like GNU pth or PTL2. These
libraries are known in the literature as pure user-space systems,
meaning that all of the thread implementation is done without the
kernel's knowledge, or 1:N threads, meaning that all of the
application's threads are multiplexed onto one kernel
process. Libraries of this form have problems implementing the full
promise of threads; in particular, they are vulnerable to some
computation or system call blocking all of the threads in the process,
since the kernel is unaware of the existence of threads.

Ben Collver: The paper "Portable Multithreading" from Ralf S. Engelschall gives a description of how the
GNU PTH works.

One of the drawbacks of GNU PTH is that threaded applications do not
benefit from multiprocessor machines. Another drawback is that the
scheduling is not preemptive.

What does POSIX standard compliance bring?

Christos Zoulas: Application portability between different platforms, i.e. a threaded
application that works on a POSIX compliant pthread library on any other OS will work on NetBSD.

Nathan J. Williams: POSIX threads are really the only thread game in town these days.

They are based on "scheduler activations." What are they?

Nathan J. Williams: Scheduler activations are a mechanism invented by Thomas Anderson in a
1992 paper,
which provides an interface between an operating system kernel and an
application for maintaining a desired level of concurrency. In this
system, the application informs the kernel how much concurrency it
has, e.g. how many simultaneously computing threads it will use, and
the kernel maintains a certain number of "activations," or scheduleable
entities, on which the library layers application computation. It
includes messages for adjusting the size of an application's
computation allocation and for notifying of operations that have
blocked in the kernel and resumed.

The principal virtue of SA for a POSIX thread implementation is that
the number of scheduleable kernel entities, and kernel resources in
general, is limited to the number of concurrently-operating threads,
not the total number of threads in the application. This saves
resources and prevents a lot of unnecessary intra-thread scheduling
competition.

You might also want to read my USENIX paper "An Implementation of Scheduler Activations on the NetBSD Operating System".

What is the new kqueue framework?

Jaromir Dolecek: Traditional select()/poll() is only limited to descriptors.
There are other interesting kernel events, such as signals, or file system changes.

Furthermore, select()/poll() doesn't scale very well for a large
number of watched descriptors. Their efficiency decreases quickly
with larger sets -- the whole watched descriptor set must be passed
for each syscall invocation, forcing the system to perform two
memory copies across the user/kernel boundary, reducing the memory
bandwidth available for other activities. The internal kernel
handling is not very scalable either, forcing the kernel to do
potentially several passes through the list and having poor
interaction if a descriptor is watched by more than one process or
thread.

kqueue provides a generic method of notifying the user when an event
happens or condition holds, and also provides further information
related to the event. The kqueue framework is stateful, application
registers interested in events with the kernel via the kqueue descriptor,
and can wait for any of the events to occur. For NetBSD 2.0,
supported events include the traditional functionality provided by
select()/poll(); notifications for file system events such as
unlink, rename, attribute changes, similar to the Windows directory
notifications or IRIX's /dev/imon; signals; and an arbitrary number of
timers. All types of descriptors are supported, including FIFOs,
sockets, open files or other kqueue descriptors. There are no limits
on the number of registered events for any type of object.

kqueue is usually used synchronously. The kqueue descriptor is pollable
too, so it's compatible with the traditional non-blocking application model. Asynchronous notification of new events is available as well, by using standard O_ASYNC signal semantics on the kqueue descriptor.

How does kqueue work?

Jaromir Dolecek: Internally to the kernel, kqueue has a concept of kernel filters. When
application registers for an event, the appropriate filter is called to check if the event has already happened. If this is the case, the event data is immediately available for pickup. Once registered, the watch set is never re-evaluated again -- the kernel just automatically removes items from the kqueue list when watched descriptors are closed.

When a watched event happens, the condition and associated state
is recorded in the appropriate kqueue structures, and the application
can pick up the event. If the event happens several times before the
application reads the event data, the data reflects only the last event context.

kqueue has O(1) scalability for number of watched events, contrary
to O(n) for select()/poll(). There is no performance degradation
when the same descriptor is watched by more than one process or thread, either.

It's very easy to convert applications to use kqueue. System
utilities such as tail(1), syslogd(8) or inetd(8) were modified to take advantage of this functionality for better scalability and
simpler event handling, and kqueue is also being used
in third-party code. The API is available in all the main open-source
BSDs, including NetBSD, FreeBSD, and OpenBSD.

Christos Zoulas: This came from FreeBSD's Jonathan Lemon.

Regarding memory protection: non-exec stack and/or heap, propolice, W^X. Could you make a summary of the status of these technologies in the 2.0 release?

Chuck Silvers: In NetBSD 2.0, both the process stack and heap are non-executable by default
on platforms where the hardware supports it. Propolice is not integrated
into NetBSD, nor are the other OpenBSD W^X changes. Yet.

Which hardware includes support for these protections?

Chuck Silvers: The hardware platforms that fully support non-executable mappings are
AMD64, SPARC64, SPARC (sun4m, sun4d), PowerPC (IBM4xx), Alpha, SH5 and HPPA.

The hardware platforms that partially support non-executable mappings are
i386 and PowerPC OEA (eg. macppc). See this document for more details.

I like the idea of Verified Exec, however I'm wondering how much time is needed to setup the environment?

Brett Lymn: Not long. You need to build a kernel with the Verified Exec in it. The
most time-consuming part is generating the fingerprint file. To help with
this there are a couple of scripts in /usr/share/examples/veriexecctl/
which will scan a system and generate the fingerprints for all
files that are appropriate. Depending on the sort of machine you have,
this may take a while to complete. Once done the resulting signatures
file can be placed into /etc, ready to be loaded.

How does the verifying process work?

Brett Lymn: Early in the boot process, a list of fingerprints (either md5 or sha1 hashes) for
all the executables and shared libraries is loaded into kernel memory.
When something is executed or a shared library referenced, the hash of
that executable/file is calculated and compared against the list. If they
match, then the execution/access is allowed; otherwise it is blocked. The
idea being to prevent trojans being inserted into system utilities, and to stop
unknown binaries from executing. More details here.

systrace has finally been included in the base system. Is there any
application systraced by default?

Niels Provos: There is currently no application systraced by default. systrace can
be used by the paranoid system administrator to further tighten their security. For example, monkey.org is a small ISP that runs systrace to provide restricted shell access to all their users. It seems to be working fine for them and systrace has already prevented compromises for them.

NetBSD ported UFSv2 from FreeBSD-5. Have you introduced any improvements or
new features during the porting effort?

Christos Zoulas: No, we have not. We have introduced a block level snapshot mechanism
(/dev/{r,}fss*), but that is still experimental.

Have you had any portability problems with any of your multiple supported
architectures?

Christos Zoulas: With UFS2? We had compatibility issues with our boot blocks and UFS1
file systems, but we've ironed those out. If you mean general code portability there are 3 classes of problems:

  • Endianness assumptions
  • Alignment constraint violation
  • Type size assumptions

Most of the problems we encounter are in pkgsrc, not in our own sources.
But if you can get your code to work on i386 (little endian, 32 bits, no alignment constraints) and SPARC64 (big endian, 64 bits, and alignment constraints), you've probably got most of the portability bugs figured out.

Finally, there is operating system feature portability, but that has become easier because most OSes are POSIX compliant these days.

What is the status of NetBSD/xen?

Luke Mewburn: It works. We're using it within the project to provide multiple virtual machines on one physical machine for internal administrative purposes.

People that run NetBSD on a notebook should try this release because of ACPI. What concrete advantages does it bring?

Luke Mewburn: Support for newer laptops, including thermal zone support.
However, there is no ACPI suspend/resume support at this time.

Hubert Feyrer: *cough* I wouldn't say that ACPI is the only reason for trying NetBSD on a
notebook. The fine framework for PCMCIA and Cardbus are very well worth
giving NetBSD a try, in addition to the support readily available for a
big number of USB devices and also modern PCI devices found in today's
laptops and notebooks. The USB and PCMCIA/Cardbus systems of NetBSD are
used as a base for other systems' implementations.

There are multiple new standards for wireless networking and I'd like to
know which are supported: WPA, WiFi-Max and all the 802.11a/b/g/i.

Hubert Feyrer: This depends on the cards that provide these features, and the drivers for
them. For Atheros cards, no WAP is available; WiFi-Max is a non-technical
term that may be equivalent to the Atheros "Turbo" (108Mps) mode, which is
fully working under NetBSD. 802.11 standards supported by NetBSD 2.0 are
"b", "a" and "g". (I have no idea if "i" is already out?).

Luke Mewburn: 802.11a/b/g is supported for various cards, including Atheros-based devices.
Support for WPA is in progress. No idea about WiFi-Max.

An interesting feature for laptop users is the cryptographic disk driver. How does it work?

Roland Dowdeswell: The best description can be found in the paper and man pages (which
I have on my web page). But, in
short form, CGD attaches to a disk or a partition on a disk and
presents a pseudo disk to the rest of the operating system. It
encrypts and decrypts the information in the transition. So, if
you attach a CGD to, say, /dev/wd0e, then when you read and write
to /dev/cgd0[a-h] what you get is the decrypted version of what's
on /dev/wd0e. You can do anything with /dev/cgd0[a-h] that you
can do with normal disks (mainly file systems and swap, but things like database backends are possible).

How sensitive is it to unexpected system shutdown?

Christos Zoulas: It will work just fine. It is a block-level driver, so it works just like
another layer on top of the regular disk.

Roland Dowdeswell: Use of CGD does not affect the OS's behaviour on unexpected system
shutdown because it preserves all of the atomicity expectations of the underlying disk device.

Did you port it from OpenBSD?

Roland Dowdeswell: No. When I wrote CGD, I looked for prior art but I found no open
source disk encryption that was both usable and cryptographically
reasonable. OpenBSD allows you to add encryption to vnd(4), which
works in a similar way, but it only supports one cipher (blowfish)
and the key generation is weak, so one shouldn't rely on it to
actually protect your data.

Given that, I wrote CGD from scratch.

CGD has been partially ported to OpenBSD by Ted Unangst, but last I checked it had not been integrated into the standard release.

I remember that NetBSD had a LiveCD based on the 1.6 release. Is there any plan for an updated version based on the 2.0 code?

Jan Schaumann: I don't know about plans, but there's sysutils/mklivecd in pkgsrc, which allows any user to create their own LiveCDs. Maybe worth mentioning.
(I'll actually be using 2.0-based LiveCDs for a local programming contest organized by the ACM.)

Hubert Feyrer: Yes. A German language, 2.0-based LiveCD was already published by the
German FreeX magazine.
The CD boots into KDE and offers KOffice, Sodipodi and a number of other applications.
An international version of the CD has been kindly made available by FreeX
editor Joerg Braun which will be made available at about the time 2.0 will be released.

Category:

  • BSD
Click Here!