n2k14 hackathon report: dlg@ on locking, midlayers, and network drivers

Contributed by tbert on 2014-02-14 from the only-run-firefox-on-your-web-servers dept.

David Gwynne (dlg@) tells us why he travelled all the way from Australia to come to New Zealand:

The only real plan I had made leading up to the hackathon was to to do my best to move our SMP support forward. Despite that, I got distracted pretty soon after I turned up because of a discussion with krw@ about leftover work we had after the big restructure of the SCSI midlayer.

3 or 4 years ago I implemented a large change in the SCSI midlayer to provide better utilisation and scheduling of SCSI adapter resources by introducing an abstraction called "iopools". The previous midlayer relied on handling of a special command error code called XS_NO_CCB to provide back pressure to devices it couldn't do work for due to adapter resource shortages. iopools replace the XS_NO_CCB infrastructure with a scheduler that providing fair access to an adapters command slots.
The intention has always been to go through and replace XS_NO_CCB with iopools, but that requires modification to pretty much every SCSI driver in the tree. Despite iopools being introduced 3 or 4 years ago there are still a number of drivers that hadn't been cut over, either due to a lack of hardware to test on or due to the complexity of the driver. I spent the first day and a half of the hackathon looking at these leftovers and trying to fix a few.
So I spent some time working on some of the leftover drivers, and took advantage of the hackathon to put the code into the tree early to make it easy for (or force, depending on your perspective) people to try the diffs. These drivers included those for rare hardware like aac, esp, pscp, si, asc, and wdsc, and stupidly complicated drivers like wdc and atapiscsi which required extremely subtle changes to the whole wdc/ata stack. I think I screwed up more than half of them, but I was able to deal with the fallout quickly thanks to being in front of a computer at the hackathon.
After looking at moving ahc(4) and ahd(4) to iopools I quickly remembered that I had other plans and turned to SMP things.
One of the biggest stumbling blocks to further work on SMP in our kernel has been an interaction between the big kernel lock and mutexes. The path forward in that environment was to take entire subsystems and layers out from under the big lock in one go, However, in practice that has obviously not happened in any significant way because some of these layers are quite large (eg, the block layer and network stacks) and need the hardware device drivers interacting with that layer to come with them. All of them. For example, we have approximately 140 network drivers across all our architectures that would have to be made biglock free before we could take the network stack out too. Moving only a portion of these layers out at a time guarantees we will experience deadlocks which means we can't do it.
kettenis@ has recently implemented a change to mutexes that lets us sidestep those deadlocks, which in turn means we can get on with making evolutionary changes rather than revolutionary changes to improve our SMP scalability. I spent some days taking advantage of that by moving some drivers and layers I am familiar with toward fine grained locking, specifically myx(4), mpi(4), mpii(4), sd(4), the SCSI midlayer represented by scsibus(4), and some of the input side of the network stack.
That code didn't take long to write and it seemed to work, but it was hard to have any confidence in it or judge whether it improved anything or not.
One of the common concerns with moving forward with SMP is introducing performance regressions. If we have to take and give up locks (and the big lock in particular) then we could be adding more places to spin on the locks rather than actually doing more work without the lock held. The other problem with SMP is that it turns kernel from something kind of like a single process with signal handlers into a heavily threaded program with shared memory and locks. Threaded programs are a lot harder to reason about and observe. It is especially difficult in the kernel because the existing debug infrastructure can affect the timing or locking of the kernel, which can often hide the problem or make it unusably slow.
To better measure and observe what the kernel was doing, I implemented a basic event tracing facility for the kernel. This tracing showed up two interesting things.
Firstly it showed that my SMP safe code wasn't actually being run without the biglock because the interrupt code on the sparc64 I was working on hadn't been updated to respect the MP safety flag for interrupt handlers. kettenis and I quickly resolved that problem and I started testing again.
Second, it showed that there were spots where the big kernel lock was being held for "very long" periods of time, which in my world (well, the code) is about 50 microseconds. It turns out this problem is not SMP specific and these pauses occur without my tweaked drivers. The two pauses I found were related to memory management.
One of the pauses was generated by the buffer layer and it's management of the cache. When memory is available, the cache was supposed to sit between a high and low watermark. When the high watermark was about to be exceededit decided to free memory down to the low watermark. If you have a lot of memory in a machine the difference between the high and low watermark can be quite big, and it frees it one page at a time. The more memory you have the longer it will take.
I came up with a diff that restricted it to freeing only 8 pages at a time, and then deferring the rest of the frees so they could be interleaved with other work in the kernel. beck@ looked at the problem in a bit more depth and determined that simply removing the low watermark and simply maintaining the high watermark was a better solution. That change is now in the tree.
The other pause comes from large processes exiting and having their memory returned to the memory management layer, also one page at a time. There's still ongoing discussion about how to deal with that properly. In the meantime, don't use Firefox on your firewalls.
I also worked on improving the APIs available for working on SMP systems in the kernel. As a result of that I have committed in an API for doing various atomic operations which originally came from Solaris via NetBSD. I also implemented ticket locks for the big giant lock on sparc64. That, and previous work I'd done on ticket locks for amd64 and i386 should hit the tree after release. I hope at some point to also introduce something to help with reference counting in the kernel, which appears to be an extremely important but often overlooked part of dealing with SMP. The introduction of the atomic API will make doing that a lot easier.
My "take home message" from the week in Dunedin is that SMP is extremely hard and requires a huge amount of mental energy compared to any of my previous development work in OpenBSD. Because of where we are in the development cycle only a chunk of the code I wrote for SMP has hit the tree yet, but that's not a bad thing. I committed the bits of myx(4) I was confident in, but after deploying it at work when I got back from Dunedin I discovered two really nasty races in my code, one of which caused an outage of one of the firewalls. Fixing those two simple races took maybe another 3 days of my time to work through. It's hard. Or I haven't had enough experience yet.

(Comments are closed)

Comments

By Bonaventure Soriaux (31.193.133.168) on 2014-02-14 22:14

Fantastic stuff; thanks for the hard work you're putting into the OpenBSD internals and the writeup so we can follow along!
By Anonymous Coward (172.56.39.63) on 2014-02-15 01:13

While I realize that NetBSD and OpenBSD have diverged quite severely over the years, would the fact that NetBSD now has fine grained SMP scalability be of any help? I LOVE OpenBSD and hope that it's scalability improves soon. :)
Comments
1. By David Gwynne (2001:388:e000:ba00:754a:aa0e:e99f:c404) david@gwynne.id.au on 2014-02-15 05:42
  
  > While I realize that NetBSD and OpenBSD have diverged quite severely over the years, would the fact that NetBSD now has fine grained SMP scalability be of any help? I LOVE OpenBSD and hope that it's scalability improves soon. :)
  
  we're talking approximately 20 years of divergence. all it takes is for one thing thats been locked by netbsd that we use in another place to ruin everything. given the differences and the effort to verify their locking would still apply, id argue its not worth it and we should do our own code ourselves.
  
  the most useful thing we can get from netbsd, or any other project really, is ideas and patterns and apply them.

Latest Articles

Sat, Jul 05
- 08:17 KDE Plasma 6.4 has landed in OpenBSD (0)
- 08:13 Blink and you'll miss it! 4096 colours and flashing text on the console! (2)
- 08:08 Game of Trees Hub now taking signups for repository hosting (0)
Sat, Jun 28
- 05:57 Game of Trees 0.115 released (0)
Tue, Jun 24
- 07:48 Game of Trees 0.114 released (0)
- 07:23 Call for testing: bge/bnx/iavf/igc/ix/ixl/ngbe/pcn: ifq_restart() fix (0)
Mon, Jun 16
- 08:22 j2k25 hackathon report from kn@: installer, low battery, and more (0)
Fri, Jun 13
- 11:18 dhcpd(8): use UDP sockets instead of BPF (1)
Thu, Jun 12
- 12:32 clang(1)/llvm/lld(1) updated to version 19 (0)

Credits

Copyright © 2004-2008 Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to April 2nd 2004 as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]