ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Puffy's Marathon: What's New in OpenBSD 4.2

by Federico Biancuzzi
11/01/2007

OpenBSD is famous for its focus on security. Today, November 1st, the team is proud to announce Release 4.2.

Even though security is still there, this release comes with some amazing performance improvements: basic benchmarks showed PF being twice as fast, a rewrite of the TLB shootdown code for i386 and amd64 cut the time to do a full package build by 20 percent (mostly because all the forks in configure scripts have become much cheaper), and the improved frequency scaling on MP systems can help save nearly 20 percent of battery power.

And then the new features: FFS2, support for the Advanced Host Controller Interface, IP balancing in CARP, layer 7 manipulation with hoststated, Xenocara, and more!

Federico Biancuzzi interviewed 23 developers and assembled this huge interview...

There has been a lot of work to improve performance in PF and networking! What results have you achieved and how?

Henning Brauer: Network data travels in so-called mbufs through the system, preallocated, fixed size buffers, 256 bytes on OpenBSD. They are chained together, and they can, instead of carrying the data itself, point to mbuf clusters of 2 KB size each.

PF needs to keep track of various things it does to packets like the queue ID for ALTQ on the outbound interface, the tags for the tag/tagged keywords, the routing table ID, route-to loop prevention, and quite a bit more. Previously we used mbuf tags for that. mbuf tags are arbitrary data attached to a packet header mbuf. They use malloc'd memory. And that turned out to be a bottleneck. So I finally did what I wanted to do for some time (and that Theo, Ryan, and I discussed before)—put this extra information directly into the packet header mbuf, not mbuf tags, and thus get rid of the need to malloc memory for each packet handled by PF.

Since PF has its tentacles everywhere in the network stack, changing this was a big undertaking, but it turned out to make things way easier in many cases and even fix some failure modes (we cannot run out of memory for the mbuf tags any more).

In our tests with a Soekris 4801 as bridge with the simplest possible ruleset (just one rule: "pass all"), this change doubled performance, it went from being able to forward 29 to 58 MBit/s.

What about other PF optimizations?

Henning Brauer: Packet forwarding can skip IPsec stack if no IPsec flows are defined. This is simply a shortcut: if there are no IPsec flows in the system we do not need to descend into IPsec land. This yields a further 5 percent improvement in packet forwarding performance.

Also, quite some time ago, someone discovered that firewalls replied with RST or ICMP to packets with an invalid protocol checksum. Since an end host wouldn't have replied due the checksum error, you could spot the firewall. Due to that, we were verifying the protocol checksum for each and every packet in PF. I changed it to only do so if we are actually about to send an RST back. Voila, 10 percent higher forwarding rate.

How does your work on improving PF perfomance affect the two settings for states bounding (if-bound/floating)? What is the default in 4.2? Which setting is faster?

Henning Brauer: I have completely rewritten the code dealing with interface bound states. Previously, there was one (well, a pair really) state table per interface, plus a global one. So for each packet, we first had to do a lookup in the state table on the interface the packet was coming in or going out on, and then, if we didn't have a match, in the global table. Ryan split a state entry in the "key" (everything needed to find a state) and the state info itself.

In the second step I changed things to allow more than one state to attach to a state key entry; states are now a linked list on the state key. When inserting, we always insert if-bound states first and floating ones last. At lookup time, we now only have to search the global state table (there is no other any more), and then start walking the list of states associated with the key found, if any. We take the first one and check whether the interface the state is bound to matches the one we're currently dealing with or is unset (aka floating state). If that is true, we're done. If not, get the next state and repeat until we found one, and if we don't there is no match. This works because we make sure that if-bound states are always before floating ones. In the normal case with floating states there is only one entry and we're done. This change increased forwarding performance by more than 10 percent in our tests.

As you might guess, this was not a simple change. Ryan and I have been thinking about it, discussing and developing the concept for more than a year—well, this is only part of it really.

Defaults have not changed; floating states are default, and there are very, very, very few reasons to change that, ever. There is no performance difference between the two.

"Improvement in the memory pool handling code, removing time from the pool header leads to better packet rates." How did you spot this, and who fixed it?

David Gwynne: That was something I found at the hackathon this year while profiling the kernel. Ted Unangst fixed it for me.

The memory for packets in the kernel is allocated out of pools. Every time a chunk of memory was returned to a pool the hardware clock was read so it would know which pool was last used. Reading the hardware clock is very slow on some machines, which in turn causes things that use pools a lot to slow down. Since I was trying to move a lot of packets through a box, I was noticing it. After describing the problem to Ted, he decided that the timestamping was unnecessary and removed it.

"Make the internal random pool seed only from network interrupts, not once per packet due to drivers using interrupt mitigation more now." Another step to improve networking speed! Who worked on this?

David Gwynne: This was something I found while profiling the kernel at the Calgary hackathon this year. This time it was fixed by Ryan McBride.

The kernel is responsible for providing random numbers, but to do that effectively it has to collect randomness from suitable sources. One of those sources is the times that interrupts occur. The time that network interrupts are measured at is when they signal that a packet has been received. The time is read from the hardware clock which, as we've said before, can be really slow. The problem with this is that modern network cards are capable of receiving many packets per interrupt, which in turn means we read the hardware clock several times per interrupt instead of once like we intended.

Reducing the number of these clock reads means we can spend more time processing the packets themselves, which in turn increases the throughput. Ryan managed to do this by modifying the network stack to defer this stirring of that kernel random pool to the softnet interrupt handler.

When a network cards interrupt handler sends a packet to the stack, it is quickly analyzed to figure out if it is a packet we're interested in. If it is interesting, then we put it on a queue to be processed and a soft interrupt is raised. When the hardware interrupt is finished the soft network interrupt is called and all the packets in that queue are processed. So for every hardware interrupt that occurs, we end up doing one softnet interrupt too. By sticking the stirring of the random pool at the top of the softnet handler, Ryan got us back to reading the clock once per interrupt instead of once per packet. Throughput went up again.

"Enable interrupt holdoff on sis(4) chips that support it. Significant performance gain for slower CPU devices with sis(4), such as Soekris." Would you like to tell us more about this?

Chris Kuethe: Quite a number of network adapters have a configurable mechanism to prevent the machine from being run into the ground under network load. This is known as holdoff, mitigation or coalescing. The general idea is that the network adapter does not immediately raise an interrupt as soon as a frame is arrived; rather the interrupt is delayed a short time—usually one frame or a few hundred microseconds—in case another frame might arrive very soon thereafter.

Picking a good delay value, or set of conditions under which to signal the arrival of a frame is not easy. Too much holdoff and network performance is severely degraded, too little and no benefit will be noticed. When ping times go up and TCP stream speeds go down, you're delaying too much.

In the case of the Soekris (or anything else that uses sis(4)), interrupt holdoff was not enabled. By enabling holdoff, we allow the network controller to delay and buffer a few frames. This spreads cost of the interrupt across several packets.

What challenges does 10 Gb Ethernet support present?

David Gwynne: Our biggest challenge at the moment is finding developer time. We (always) have a lot of work to do but the time to do it is hard to find sometimes.

On a more technical level, supporting 10 Gb hardware is about as hard as it is to support any other new hardware. Someone has to sit down and figure the chip out and move data on and off it. That is possible using a driver from another operating system, but it is way easier if you have documentation. If you have documentation the driver is usually faster to develop and it always ends up being higher quality and more reliable. Fortunately a significant proportion of the vendors in the 10 Gb space are happy to provide documentation for their hardware.

Supporting the movement of packets through our network stack to the hardware at 10 Gb speeds is a problem we've always had. We've always wanted things to go faster, 10 Gb is just another level to strive for. Having said that though, moving packets in and out of boxes that fast makes problems more noticeable and therefore more attackable. 10 Gb hardware is also getting a lot smarter about how it moves packets from the chip to the computer itself. Some of those mechanisms are worth pursuing, others aren't.

One of the common mechanisms is offloading some or all of the work the network stack does onto the hardware. This ranges from offloading TCP segmentation (TSO and LSO) all the way up to full TCP Offload Engines (TOE). We actually like the OpenBSD network stack though, so we aren't going to implement support for this.

The other popular mechanism is to parallelize the movement of packets on and off the card (i.e., on "old" network cards you can only have one CPU operating on the hardware at a time, while a lot of 10 Gb cards provide multiple interfaces for this same activity, meaning you can have several CPUs moving packets on and off the chip at the same time). Supporting this obviously provides a big challenge to OpenBSD since we have the Big Giant Lock on SMP systems. Only one CPU can be running in the kernel at a time, so you can only have that one CPU dealing with the network card.

4.2 brings a new 10 Gb driver for Tehuti Network controllers (tht(4)), and a lot of improvements in the kernel and the network stack that help throughput. These improvements help all network cards though, not just 10 Gb ones.

What does this release offer to Wi-Fi users?

Jonathan Gray: In 4.2 the main additional wireless hardware support comes in the form of support for Marvell 88W8385 802.11g based Compact Flash devices in malo(4). This is of particular interest for zaurus users wanting faster network IO. Beyond that it was mostly some 802.11 stack/driver bug fixes. Two new drivers recently hit the development branch, Damien Bergamini's iwn(4) driver for Intel 4965AGN Draft-N and a port of Sepherosa Ziehau's bwi(4) driver for Broadcom AirForce/AirPort Extreme devices from DragonFly, however these were too late for 4.2 and will appear in 4.3.

Did you improve isakmpd interoperability?

Todd T. Fries: There are two important isakmpd(8) interoperability fixes new with the 4.2 release. One permits interoperability with other IKE implementations that re-key on udp port 4500, instead of expecting port 500 re-keying to occur. The other permits key exchange with RSA signature authentication to work with Cisco IOS. Both expand on the wide range of IKE implementations isakmpd(8) is already interoperable with.

"Provide software HMAC for glxsb CPUs, so IPSec can use the crypto HW." How much does this feature improve performance concretely?

Markus Friedl: Improves IPsec performance on a 500 Mhz machine from 17 Mbit/s to 30 Mbit/s with AES/SHA1 and PF enabled. This does not affect OpenSSH, since OpenSSH could use the hardware before this change.

Why have you replaced the random timestamps and ISN generation code for TCP with a RFC1948 based method?

Markus Friedl: Machines are getting faster, they are doing more TCP connections per second, so TCP client port reuse is getting much more likely. Both random timestamps and random ISNs make it hard for the TCP server to distinguish "old" TCP segments from new connections. Using RFC1948 based methods restore monotonic timestamps and ISNs for the same 4-tuple, making it possible for the TCP server to allow early port reuse.

You fixed a really old bug in the socket code. Would you like to tell us more about it?

Dimitry Andric: This is actually something that I didn't find myself. I just happened to see this issue coming along on the FreeBSD CVS commit list. It's a very tricky macro bug, that has existed since the very first revision of the sblock() macro, and was apparently never noticed until recently.

It replaces this version of a macro in src/sys/sys/socketvar.h:

#define sblock(sb, wf) ((sb)->sb_flags & SB_LOCK ? \
                (((wf) == M_WAITOK) ? sb_lock(sb) : EWOULDBLOCK) : \
                ((sb)->sb_flags |= SB_LOCK), 0)

with:

#define sblock(sb, wf) ((sb)->sb_flags & SB_LOCK ? \
                (((wf) == M_WAITOK) ? sb_lock(sb) : EWOULDBLOCK) : \
                ((sb)->sb_flags |= SB_LOCK, 0))

Here sb is a pointer to struct sockbuf, and wf is an int ("waitfor").

The only difference is moving that next-to-last right parenthesis. But it changes the entire meaning of the macro! The original version will always return 0, since the ",0" is the last part of the complete expression. This was not what was intended, it should only directly return 0 in the case that sb didn't have its SB_LOCK flag set.

If you'd write this as a much clearer inline function, without the ?: operator, the original would become:

inline int sblock(struct sockbuf *sb, int wf)
{
        if (sb->sb_flags & SB_LOCK) {
                if (wf == M_WAITOK) {
                        (void) sb_lock(sb); // return value gets ignored
                } else {
                        (void) EWOULDBLOCK; // value gets ignored
                }
        } else {
                sb->sb_flags |= SB_LOCK;
        }
        return 0; // always succeeds! yeah right :)
}

while the fixed version would become:

inline int sblock(struct sockbuf *sb, int wf)
{
        if (sb->sb_flags & SB_LOCK) {
                if (wf == M_WAITOK) {
                        return sb_lock(sb);
                } else {
                        return EWOULDBLOCK;
                }
        } else {
                sb->sb_flags |= SB_LOCK;
                return 0;
        }
}

This is a good example of why the ?: operator should be used with caution, at least in complicated expressions with macros.

IP balancing support has been added to carp(4). Can you explain in which scenarios it can be used and how it works?

Marco Pfatschbacher: IP balancing provides the same functionality as ARP load balancing, but without its limitations. It can also share traffic that comes across routers and it works with IPv6.

IP balancing can be used to build high available load balanced systems for servers, VPN gateways, firewalls, or to make OpenBSD based load balancers load balance themselves.

The basic concept is that we use the shared medium properties of the Ethernet to direct the traffic to all carp nodes. Each one hashes the source and destination IP at ip_input() and decides based on the status of the carp load balancing group whether a packet should be accepted or just dropped silently.

The scalability mainly depends on the ratio between network traffic and the load that it causes on the server side. This is because each node has to cope with the incoming traffic of the entire cluster up to ip_input(). For example, IP balance will not help as much to scale plain routers. A CVS server however could be perfectly scaled up to 8 or more nodes.

Currently IP balancing is a little complicated to configure, since each load balancing group has to be built out of multiple carp interfaces. I'm working on a change to integrate the load balancing into a single carp interface.

You also fixed a problem that affected carp...

Henning Brauer: I basically fixed a bug I found the hard way... on a production router. :(

To add/change/delete/get a route, you send a message on the routing socket. That message is echoed to all open routing sockets. So by opening a routing socket and listening to the messages, you can keep track of all changes to the routing tables; the routing daemons use that to keep their views of the kernel routing table in sync, and "route monitor" allows you to see these messages in realtime. When something inside the kernel changes the routing table, it must make sure a message indicating so is generated and sent on the routing sockets. Everything does. Carp did not. Now it does. :)

Carp plays with routes for the IP adresses on the carp interface when changing from master to backup and vice versa. The effect of the missing routing messages was that when the bgpd process was started and the carp interface was in backup state, bgpd got the wrong nexthop for connections over the carp interface, and thus, once a failover happened, bgpd blackholed traffic that was supposed to go over the carp interface.

There are a lot of new features in hoststated(8), would you like to describe the most interesting ones?

Pierre-Yves Ritschard: Hoststated has had a lot of improvements and new features between 4.1 and 4.2.

First and foremost, hoststated now has layer 7 support, which means it is not only able to load balance at the packet level (layer 3), but at the application level. Our layer 7 support includes HTTP SSL termination, generic SSL termination, HTTP header manipulation and more.

Hoststated is also now able to gracefully reload for layer 3 configurations, while layer 7 configuration reload will follow shortly. Additional reporting has been added and hoststatectl can now show host reliability.

As always, we've done our best to provide a clean and consistent configuration syntax, and have more plans to improve it for the next releases!

ftp-proxy(8) is now able to automatically tag packets passing through the pf(4) rule with a supplied name. How does it work?

Henning Brauer: Well, it is really simple. ftp-proxy just adds "tag foo" to the rules it inserts for tracking the data connection, where foo is the name you supplied. That makes it way easier later on to match packets which we handled by these ftp-proxy-inserted rules, be it for filtering or queueing or whatever else pf allows.

What's changed in ftp(1)?

Pierre-Yves Ritschard: There are three new things to note in ftp(1). First it is now able to go through HTTP proxies requiring passwords, just like it was able to send a password to an FTP server. Many environments provide access through authenticated proxies, its nice to be able to still use ftp(1) and pkg_add(1) there.

ftp(1) is now also able to parse Netscape-like cookie jars, all Netscape and Mozilla browsers use this format, while it won't store cookies, it will allow you to read the ones created by your everyday browser. This is especially useful when you need to download a file which requires HTTP authentication through cookies but do not want to rely on your browser's download manager.

Last, Marc Espie provided a way to keep FTP control connections alive even in environments where the TCP session ripping is overly aggressive. ftp(1) can now send NOOP packets every once in a while to maintain a flow of data on the control connection and keep it from being timed out before a transfer on a data channel is done. This will help users of pkg_add relying on a FTP server from seeing timeouts when downloading big packages.

What's new in the ports framework and in pkg_* tools?

Marc Espie: Users shouldn't notice much, but a lot of stuff has changed internally. There's been a large number of internal clean-ups in pkg_add, and a few changes to related tools such as ftp(1). Some of them are not really very visible, since they're mostly preparation for further things to come.

The most useful change is probably the addition of FTP_KEEPALIVE. People who live behind a firewall that drops connections are going to love this. If you set FTP_KEEPALIVE to a duration (say 60 seconds), then ftp will try to ensure an inactive connection doesn't get dropped. This makes a big difference to the reliability of pkg_add over ftp:// urls!

The second interesting change is that pkg_add now stops at the first location in the PKG_PATH that has suitable candidates for addition/updates. Thus, you can now fill up PKG_PATH with suitable back-up mirrors, and not have to confirm each choice through 10 package candidates.

A few minor issues were fixed as well, pkg_add will yield better diagnostics when, for instance, it can't find a library that matches a dependency. It will also deal with some fringe cases better... hopefully, you won't notice any of this. It just means updates will work transparently.

We now have enough experience to say pkg_add -u rocks. It works as advertized, and more, and should be able to let you update your system through two or more releases.

As far as ports go, there's more stuff, as usual. And it works better. Most software has been updated to newer versions. If, by any chance, you still build stuff from source (though discouraged), you'll notice that distfiles checksums are now using SHA256, to satisfy paranoid people.

There haven't been a lot of changes to the ports infrastructure, it's obviously fairly stable these days. A few tweaks like STARTDIR have come in. You can use STARTDIR to start a build (or anything) at a given place in the ports tree.

As far as new ports go, the most noticeable one this release is probably apache2. We finally added it to help porting more stuff to apache1, the one and truely apache in the OpenBSD tree. Oh yes, and there's been a big gnome update.

In short, there's nothing really exciting for the end user. Internally, we're very happy to see things ever get more robust. We see less and less bugs in package building, while the number of ports still grows at the same rate.

I noticed that the Gnome desktop received an update after some years of inactivity. Can you tell something more about that?

Jasper Lievisse Adriaanse: When I was working on porting Workrave I noted that first of all Gtk and it's dependencies were badly outdated. Once I had them updated, Workrave started complaining about a lot of missing C++ bindings for Gnome. At that moment I had some spare time from school and decided to give updating Gnome a go. In a rather short period of time me and martynas@ updated most of the Gnome ports. This was really needed as most components of the Gnome desktop were at version 2.12. So, OpenBSD 4.2 ships with Gnome 2.18.

This is the first release that includes Xenocara, a port of XOrg 7.2. What are the differences with the past (XFree/XOrg 6.x) and what do you see as improvements/advantages?

Matthieu Herrb: There are not too many differences from the user point of view. The main difference is for developers or people interested in rebuilding some parts of X after applying a patch: X.Org 7.2 now is built in a modular way, each module using GNU autotools as its build system. Xenocara add a BSD-style Makefile wrapper over this to drive the build in the correct order. In this new world it's no longer required to rebuild all of the X tree to recompile just a little driver or library. You can change to the directory holding the bits you want to recompile and run make -f Makefile.bsd-wrapper build to build and install it. I hope that will help getting more developers involved with X.Org.

For users the main change will probably be that more video drivers now use the monitor information returned by DDC probes to auto-configure them. This means that more people can run X without bothering about a configuration file.

cwm(1) has replaced wm2 as a simple-looking low-resource window manager. Why?

Matthieu Herrb: Because we are interested in having such a window manager in the tree, with modern features. Unfortunatly wm2 (not to mix with wmii) is not actively developed anymore. Cwm matches some of the criteria and has attracted the attention of enough OpenBSD developers to provide a good ground to implement the missing features (mostly the NETWM protocol support, so that Gnome/KDE applications behave correctly) in the near future.

I saw this note in the changelog: "Bring in GLw from XF4 to xenocara to replace the Mesa version." Is it something you would like to talk about?

Matthieu Herrb: Not really. The version in the XFree86 tree is using some linker tricks to make it compatible with the Motif toolkit (in addition to Xaw), while Mesa has not picked up this code. There's in my knowledge only one OpenBSD user who actually uses libGLw with Motif.

But this gives me the occasion to say that the OpenMotif port has been updated to version 2.3 in the ports. This is a good news for people using motif, since this version adds support for Xft fonts (client-side, anti-aliased). This may draw some more attention to the Motif toolkit, which unlike Gtk or Qt has been standardized by IEEE.

sendbug(1) has been rewritten, why?

Ray Lai: Largely maintainability. The old sendbug was a shell script that had a lot of problems, but nobody wanted to touch it because it was just a giant, convoluted shell script. I am fine with shell scripts, as long as they remain small. Once they grow to the size of sendbug, they are really difficult to maintain. Rewriting it in C made it a lot easier to deal with, since the C environment is a lot more controlled and you don't need to worry about weird environment variables, filenames, or editors causing behavior changes.

While rewriting sendbug, Theo and I discovered that there were a lot of functions dedicated to calling an editor and waiting until the editor closed. Sounds like a simple function, but almost every implementation had bugs in them. So in sendbug I tried to write that function correctly, getting feedback from Theo and Todd Miller. I then copied that function to the other implementations, to eliminate any bugs introduced in variations of this code.

Any change in the way OpenBSD handles bug reports?

Ray Lai: No. Sendbug is only used for sending bug reports, the backend that handles receiving reports remains the same. However, we did add some details about the users' systems to the bug reports themselves, such as the dmesg. People forget to add their system's dmesg to reports all the time, even when they are relevant. This saves some work for the reporter, now that it is done automatically.

Did you see the paper presented by Robert Watson at USENIX WOOT07? I am wondering what users of OpenBSD 4.2 should do about systrace and sudo.

Todd C. Miller: Robert contacted me when he was writing his paper and I reviewed an early draft. There seems to be some confusion with respect to sudo and systrace. The paper describes an experimental version of sudo that was enhanced to use the systrace device directly for the purpose of intercepting the execve system call. This code does not exist in any released version of sudo, though it can still be found in the sudo cvs repository. I had intended this to be part of sudo 1.7 but abandoned work on it when it became apparent that a user could work around the restrictions. If systrace were to be modified to use a look-aside buffer for the kernel arguments I may revisit the sudo systrace support.

Browsing the changelog I found "Fix a 10-year old bug in make(1) which was causing it to spin if given large -j values." Does this mean that we can finally use -j when building the src tree? And what about ports? Did you run any benchmark?

Constantine A. Murenin: It was a pointer arithmetic bug that was corrupting internal datastructures of make(1). It all started with my hardware upgrade, when I decided on using make(1)'s -j option for compiling the kernel. I have shortly noticed how unreliable it was—when a high value was given to -j, say 16 or 24, make(1) would often stop building anything and would start consuming one hundred per cent of one of the CPUs until ^C.

I then turned on some of our malloc.conf(5) options, and was able to reliably crash make(1) on a regular basis. The debugging revealed multiple problems with the memory allocation code in make/job.c, all of which have now been fixed. Additional details are available in an undeadly article.

I did do some benchmarking on building the kernel with various -j options, and results are available in my LiveJournal. In short: the difference with building the kernel is quite substantial, and there are no stalls anymore. As for the userland and ports, espie@ is now working on making make(1) do a better job there—stay tuned for the next release.

Big changes to libc and libpthread, various fixes and cleanups. Would you like to tell us more?

Kurt Miller: This release I focused on code cleanups in libpthread which spilled into libc and librthread a bit. Initially I worked on some basic cleanup of libpthread like dead code removal and data type corrections noted by lint. After that I worked on removal of libpthread specific file descriptor locking from libc. I also added non-static mutex support to libc to address thread-safety in the directory operations functions (opendir(), readdir(), etc) in a way that both thread libraries could support.

The end result is that libc is now thread library agnostic or in other words libc's locking needs can be supported by either libpthread or librthread. It also sets the stage for the removal of libpthread from the system when rthreads is finished.

What is the status of your rthread implementation? I saw these: (1) Fixes in the signal handling code when waking up. This fixes the majority of the rthreads lockings and hangups. (2) Provide hook in ld.so(1) so rthreads can spinlock to protect from races.

Ted Unangst: rthreads are still incomplete, but there's slow progress being made. Because rthreads provide real preemptive multithreading, code that previously worked without locking doesn't. ld.so tries to resolve symbols on the fly, but it needs to be careful about the condition when two threads are both resolving symbols at the same time.

"Kernel work queues, workq_add_task(9), workq_create(9), workq_destroy(9) provides a mechanism to defer tasks to a process context when it is impossible to run such a task in the current context." Translation?

Ted Unangst: Many times, a device driver will receive an interrupt from the hardware, and in response to the interrupt do some work. However, interrupt handlers aren't a good place to do work. The whole kernel is locked up, so to speak, if the work requires completing some blocking action. Previously, drivers would deal with this by creating a kernel thread. The interrupt handler adds a task to a queue and wakes up the thread. Later, the thread can take as long as necessary to complete the task. But this means every driver needs its own thread.

workq is a generic version of that code, so that each device driver can benefit from a more complete implementation.

"Removed unused strcpy and strcat calls from kernel." Why?

Artur Grabowski: Dead code. Not used. Removed. We actually shrunk the kernel by a lot this release by removing functions that nothing was using. I wrote some half magic script that pulled out all the symbols from the kernel object files then pulled out all the used symbols and we just went through the kernel with a big axe killing everything that wasn't used. It even found real bugs (functions that should have been used, but weren't because of some ifdef typos).

Move i386 to new timecounter code. Again?!

Artur Grabowski: Last release was amd64. We've been doing more architectures now. Still not all done, but the ones that could benefit the most from it have been done this release.

What is the story about the Y2K hack in date(1) that you just removed?

Todd C. Miller: As we were approaching the year 2000 a number of hacks were put in place to attempt to deal with ambiguous dates where the century was not specified. For instance, in 1997 a year specified by 02 could be either 1902 if we assume the current century of 2002 if we assume the following one. Now that we are well into the 21st century there is no need for such hacks. Hopefully, by the time the 22nd century approaches people will have gotten used to the idea of four digit years.

I saw various fixes for code handling i386 CPUs... for example: (1) Fixes in the vframe handling for i386 trap code. (2) Fix in the i386 pmap code for a possible AMD bug, which slightly speeds up TLB misses. (3) Fix for Intel errata AI91 in the i386 pmap handler code. (4) i386 TLB handling improved to avoid possible corruption on Core2Duo processors. Would you like to give us an idea of how is handling hardware (especially CPUs') errata and workarounds implementation?

Artur Grabowski: Lots of this is related to just hacking I've been doing in the pmap to make it faster on SMP. And then we started finding all those bugs in there. About the same time people got new machines that showed very strange crashes related to the pmap. We started chasing things and found that the bug we've been seeing, mainly on Core 2 machines, could in many cases be explained by CPU erratas. It wasn't fixed by that but we learned a lot of new things about how the mmu and caches were working in the hairy details and fixed problems with that. The bug is not fixed yet, it's just hidden very well and we still don't know what causes it, but at least we learned a lot and fixed a lot of related things.

Reworking the TLB shootdown code for i386 and amd64 gave you a good speed improvement. How was the situation and what type of changes did you make?

Artur Grabowski: The old framework we had for TLB shootdowns (to keep the MMUs on different CPUs in sync) was very complicated, it used locks, slow path IPI (inter processor interrupt) handling, it had weird race conditions that could cause an IPI handler to wait for the biglock (very bad), and it simply took a lot of work to do such a seemingly simple thing.

I replaced it by something that I jokingly say takes about as many instructions per shootdown that the old code had function calls (it's not excactly true, but it's closer to the truth than one can imagine). Instead of having a huge infrastructure for doing smart guesses about when we need to do TLB shootdowns and when we can avoid them, we now just shoot much more often, but each shootdown costs almost nothing compared to before. If I recall correctly this cut down the time to do a full package build by 20 percent (mostly because all the forks in configure scripts have become much, much cheaper).

CPU frequency and voltage can now be scaled on all CPUs when running GENERIC.MP on a multiprocessor i386 or AMD64 machine with Enhanced Speedstep or Powernow. How much power can we save when using the battery?

Gordon Willem Klok: Potentially a great deal of power, before the release of 4.1 I disabled many of the hw.setperf methods (such as enhanced speedstep and powernow) in multiprocessor kernels. This was necessary because without being SMP aware, twiddling hw.setperf either manually, or by apmd on your behalf, was essentially playing Russian roulette: whichever processor sysctl or apmd ran on would be the only one that the transition would be attempted on, and given the nature of speedstep and powernow in the current form, likely nothing would happen. So if you were using a multiprocessor kernel you had no opportunity to save power, while with these methods there is the potential to save a great deal of power and generate less heat, run quieter, etc.

I collected some decidedly unscientific results using my Thinkpad x60 with the battery removed measuring the draw from the wall with a kilowatt meter. With no power management in use my laptop draws about 28 watts when idle and as much as 49 watts with the both cores going full tilt. With hw.setperf set to zero (a core frequency of 1 Ghz versus the full speed of 2 Ghz) , this laptop draws about 22 watts saving about 6 watts or about 18 percent. What is even more interesting is that when going full tilt at 1 Ghz, the peak draw is only 28 watts, like the idle draw. Translating this into increased runtime is tricky and I didn't have time to conduct proper tests, but if we assume that a 18 percent power saving translates into a similar increase in runtime, assuming a runtime of 5 hours (not unlikely for a x60) at full speed, you can buy yourself almost another hour by running at the lowest hw.setperf setting.

Did you update the hw.setperf sysctl too?

Gordon Willem Klok: It was not necessary to change the hw.setperf sysctl at this time to accomplish multiprocessor aware frequency and voltage scaling. The hw.setperf code that first handles a request is fairly simple: it checks the arguments discarding values that are out of range (less than zero or greater than 100) and calls a function pointer that points at the routine that actually performs the transitions.

What I did was add a function that, after the underlying hw.setperf mechanism has been setup, stores a pointer to this function and substitutes its own. When hw.setperf is adjusted the mp_setperf mechanism executes the underlying mechanism on all processors on the system. Going forward the mechanism will likely need to be altered or the design philosophy of hw.setperf changed, at the very least AMD is moving to a model where every core in a system can be running at a different operating frequency and this will require some rethinking.

How does apmd(8) manage CPU throttling on MP systems?

Gordon Willem Klok: As the hw.setperf mechanism retained the same semantics as far as the userland interface was concerned Nikolay Sturm (sturm@) and a fellow by the name of Simon Effenberg only had to tweak apmd slightly to handle the MP case. Instead of having apmd consider only the idle time of a single processor, it looks at the average idle time of all the CPU's in a system and makes its transition decisions accordingly.

It seems that some aspects of sensorsd have been redesigned to be more user-friendly. Could you give us an overview of these changes?

Constantine A. Murenin: sensorsd was originally written when the sensors framework didn't support as many features as it does today. With OpenBSD 4.2, sensorsd is revamped to be more in touch with the recent and not-so-recent features of the framework.

For example, in sensorsd.conf(5) we now support matching by sensor type, so that a single rule can be written to apply to all temperature sensors (e.g., "temp:low=15C:high=65C").

People with server-grade hardware would be happy to know that now sensorsd requires zero configuration in order to report on the status changes of smart sensors—those that automatically provide sensor status themselves, like IPMI or bio(4)-based sensors, as well as all timedelta sensors. All that's needed to configure sensorsd in such cases is append sensorsd_flags="" to /etc/rc.conf.local.

Some improvements were made for monitoring consumer-grade sensors, too. For example, if you have an lm(4)-based Winbond chip that does fan-speed controlling, then previously you might have noticed that sensorsd was totally ignoring certain fanrpm sensors if they were marked as invalid at the time when sensorsd was started. (The reason they might have been marked as invalid in the first place is because physical sensors in certain fans don't produce valid readings if the voltage is too low, even if the fan itself is still spinning.) With the 4.2 sensorsd, if you specify to monitor a sensor that is periodically marked as invalid, then it will be reported as such, and value-based monitoring of such sensor will resume as soon as the invalid flag is reset by the driver.

Some other related features and cleanups went along the way, including usability improvements in the log format, ability to set manual boundaries for any kind of sensor, outstanding documentation updates and overall polishing.

I saw that you retired the cats platform, removed support for 80386 processors in the i386 platform code, and at the same time you are adding support to additional models of hppa and alpha systems. I think these are niche platforms, so I am wondering how do you choose what is worth your time?

Bob Beck: People choose what is worth their time based on what they want to work on, and what improves the project.

We have active developers with hppa and alpha systems, and not to be forgotten, supporting these architectures not only render easily available old hardware useful, but also helps us keep our code quality in general up. All the world not being a 32 bit Intel machine...

cats on the other hand, isn't easily available, and nobody wants to support it. There are faster arm platforms we do support that we focus our attention on.

80386 is different, it's not a separate arch, but rather a level of support in the i386 arch. It was clear nobody had actually tested OpenBSD on an 80386 system in a number of years, and none of us were willing to do so. Given that a genuine 80386 probably wouldn't run for a lot of other reasons (memory, etc.) it didn't make sense to maintain the lowest common denominator support for 80386 when it gets in the way of doing other things in the kernel.

Artur Grabowski: By what people want to work on. As far as I know, cats was so annoying that people hated the machines and wanted them to die (which they apparently did, catching fire was apparently popular). While 386 was mostly broken anyway, had a lot of baggage cluttering code we were working on and no one wanted to make sure it continued to work.

Sun finally is sharing some docs, did you take advantage of them to support PCIe UltraSPARC IIIi machines like the V215 and V245?

Mark Kettenis: It's really great that Sun is releasing more documentation for their hardware now. This will benefit all open source operating systems running on Sun UltraSPARC hardware, but one look at their wiki makes immediately clear that OpenBSD played a major role getting Sun to publish these documents.

Unfortunately that documentation wasn't available when I wrote the pyro(4) driver that was needed to support the V215 and V245. Instead I had to read lots of OpenSolaris code, and fill in some blanks myself. Because I didn't have the documentation available, the work took longer than necessary, and some interesting hardware features are missing from the driver.

I hope to revisit pyro(4) soon now that the docs are out there. But lately I've been rather busy with another feature that will make some sparc64 users happy (and for which the currently released docs are also a big help).

What is the Advanced Host Controller Interface?

David Gwynne: It is a specification that describes the interface a SATA controller should present to the operating system. It is similair to the PCI IDE controller specification in that many different vendors may have different chips all presenting a common interface, which are all supported by the one driver in an operating system. AHCI is the same idea, it just supports a different class of device, namely SATA, while the PCI IDE specification only dealt with IDE devices or devices that worked in an IDE compatible way. AHCI can be considered necessary since the SATA specification provides some advanced features (eg hotplug, bus expanders, and command queuing) that cannot fit into the existing PCI IDE interface.

There was a lot of work on the ahci(4) driver to get native support for some SATA controllers instead of going over pciide(4). What differences and advantages can we expect to see?

David Gwynne: Because the interface AHCI presents to the OS is so different to the one the PCI IDE specifies, it makes sense to have a separate and native driver for it. wdc(4), pciide(4) and ahci(4) can be considered equivalent because they provide the same functionality, namely taking commands from the operating system and putting them on the ATA devices that are hooked up to them. The difference between ahci(4) and the pciide(4) and wdc(4) drivers is where they get those commands from.

pciide(4) and wdc(4) both take their commands from wd(4) and atapiscsi(4), which are drivers that natively talk ATA commands. These four drivers are all that there is to support all the IDE controllers, and they're very tightly woven together. They were written a long time before SATA and some of its new features were ever considered, and because of this they lack the capability to support it.

On the other hand, some of the features that SATA offers sound an awful lot like what SCSI has had for years, and which our SCSI midlayer has been doing as a matter of course in that same time. Things like hotplug and command queueing are things that just work in SCSI land.

So I made the decision that rather than spending months refactoring the IDE code and potentially breaking support for everyones IDE hardware (which is a lot of people), I would write a SCSI to ATA translation layer aptly called atascsi. It sits between the SCSI midlayer (which is basically the scsibus(4) device driver) and the ATA controller that uses it and just turn SCSI commands into ATA commands. The rest of the semantics such as command queueing and so on are all handled by the existing infrastructure in the midlayer.

The other advantage of atascsi is that it can be reused on other ATA controllers. In this release both ahci(4) and sili(4) for Silicon Image 3124/3132/3531 controllers use atascsi. Also because of atascsi, all the devices on these controllers appear as SCSI devices, ie, sd(4) will attach to disks instead of wd(4).

You have included FFS2. What features does it provide?

Otto Moerbeek: The two most important benefits FFS2 provides are support for large (greater than 1 TB) filesystems, and much much quicker newfs(8) times. The code is mostly taken from FreeBSD with some parts from NetBSD. The on disk layout is largely the same, but we did not test if existing file systems can be interchanged with other BSDs. I know that NetBSD implements endian swapping for their filesystems, something we do not. So probably you'll see some differences there. Snapshot or background file system check is something we have not implemented yet. Userland utiltites that manipulate on disk data structures directly, like dump(8), restore(8) and fsck_ffs(8) have been converted to understand the FFS2 format. Also, the disklabels are now capable of partitioning very large disks, up to 128 petabytes. Partitions can also be that large, in theory at least.

What is OpenBSD 4.2 default filesystem for a fresh installation?

Otto Moerbeek: FFS1 remains the default filesystem for the foreseeable future. There are a couple of reasons for that: an important reason is that for small filesystems, FFS2 does not provide any benefit. Also, the boot code for the various platforms is not yet capable of understanding FFS2. So if you want to use an FFS2 filesystem, you'll need to create it using newfs -O2.

To convert an existing filesystem to FFS2, you'll need to tar, newfs -O2, and untar. But remember that the boot media do not support FFS2 yet, so filesystems containing the base system should remain FFS1.

I read that "some parts of the system are not 64-bit disk block clean yet, so partition larger than 2TB cannot be used at the moment." Is there anything users could do to help you extend the support?

Otto Moerbeek: Obviously by testing disks up to 2TB. Note that FFS1 can be used also on large disks, as long as the filesystem size stays below 1TB. A little warning: 1TB filesystems take a lot of time and memory to run fsck_ffs(8) on. Large block and fragment sizes can help solve that, at the cost of some wasted disk space. To make really large filesytems work in practice, a solution to the huge time and memory requirements for filesystem checking has to be implemented.

How is the work on bio(4) going on? It seems you have ported it to all the platforms!

Marco Peereboom: Bio(4) has been moving along. We have many more supported RAID cards. The one that is still glaringly missing is mpi(4). Dlg and I have been both telling each other "that we will do it soon" but none of us has found time to do it.

Bio(4) is starting to show some limitations that we want to solve. Softraid(4) is pushing some limitations like "creating disks" and the general consensus is that something needs to be done however what "it" is has not been determined at this time.

This release come with softraid(4) enabled in GENERIC so people can test. What can users do to help you? Send dmesg? Run any particular tool?

Marco Peereboom: Testing is always appreciated. I have received some pretty darn good test reports in the past and have been able to fix those bugs, so keep them coming.

softraid(4) is not enabled in GENERIC despite popular belief. Theo and I agree that it needs to do more before we can move forward and enable it. A second complication is that not all architectures are "ready" to run with softraid(4). We accomplished a lot during the last hackathon in moving in that direction but some older arches will need some love from the likes of Miod before softraid(4) can be enabled.

Theo and Tom have been doing some necessary groundwork to enable booting of softraid(4). I can't stress enough the crazy diffs Theo has been committing in the disklabel stuff. Tom on the other hand has been doing bootloader work as well. This is still not completed but it does bring us closer to a true booting softraid(4) implementation.

The glaring missing feature at this time is rebuilds. This feature is still brewing in my head. Surprisingly this is one of the most complex problems to solve in the softraid(4) stack. It is inherently racy and I don't want it to stand in the way of normal operations. Sitting at the boot prompt for several hours while rebuilding is unacceptable. Also unacceptable is calling something a background rebuild while the machine is essentially rendered useless due to performance issues. I also found out that people are attached to their data so maintaining data integrity is also high on the list :-) I have some ideas on how to solve this problem but have not made any serious attempts at implementing them yet.

What's also still missing are some additional disciplines like RAID 0, RAID 5, Concat etc. Other ideas that are floating are adding AOE (hi tedu!) and maybe iSCSI disciplines but that is further out.

What is the plan for the basic support for crypto(9) backed RAID in softraid(4)?

Marco Peereboom: Currently crypto(9) support is disabled for various reasons. The biggest one being that we have not figured out how to do key management yet. Tedu and I have been floating some ideas along the lines of keeping the key on a separate disk. For example, one can keep the key in some metadata on a USB key and only when the USB key remains inserted in the machine will softraid(4) decrypt. As soon as the USB key is pulled out softraid(4) would shut down the disk and make it unavailable. The main idea being here that the key is not physically part of the machine softraid(4) is running on and when separated both are useless. There are various hurdles to overcome that are being thought through.

Also problematic at this time is that the crypto thread is not running when softraid(4) is loaded. Obviously this causes hangs during boot time because the decrypting job never finishes. I have been talking to Theo on how to solve this problem and various scenarios are being explored...

Federico Biancuzzi is a freelance interviewer. His interviews appeared on publications such as ONLamp.com, LinuxDevCenter.com, SecurityFocus.com, NewsForge.com, Linux.com, TheRegister.co.uk, ArsTechnica.com, the Polish print magazine BSD Magazine, and the Italian print magazine Linux&C.


Return to ONLamp BSD Dev Center.

Copyright © 2009 O'Reilly Media, Inc.