Contributed by tbert on from the only-run-firefox-on-your-web-servers dept.
The only real plan I had made leading up to the hackathon was to to do my best to move our SMP support forward. Despite that, I got distracted pretty soon after I turned up because of a discussion with krw@ about leftover work we had after the big restructure of the SCSI midlayer.
3 or 4 years ago I implemented a large change in the SCSI midlayer to provide better utilisation and scheduling of SCSI adapter resources by introducing an abstraction called "iopools". The previous midlayer relied on handling of a special command error code called XS_NO_CCB to provide back pressure to devices it couldn't do work for due to adapter resource shortages. iopools replace the XS_NO_CCB infrastructure with a scheduler that providing fair access to an adapters command slots.
The intention has always been to go through and replace XS_NO_CCB with iopools, but that requires modification to pretty much every SCSI driver in the tree. Despite iopools being introduced 3 or 4 years ago there are still a number of drivers that hadn't been cut over, either due to a lack of hardware to test on or due to the complexity of the driver. I spent the first day and a half of the hackathon looking at these leftovers and trying to fix a few.
So I spent some time working on some of the leftover drivers, and took advantage of the hackathon to put the code into the tree early to make it easy for (or force, depending on your perspective) people to try the diffs. These drivers included those for rare hardware like aac, esp, pscp, si, asc, and wdsc, and stupidly complicated drivers like wdc and atapiscsi which required extremely subtle changes to the whole wdc/ata stack. I think I screwed up more than half of them, but I was able to deal with the fallout quickly thanks to being in front of a computer at the hackathon.
After looking at moving ahc(4) and ahd(4) to iopools I quickly remembered that I had other plans and turned to SMP things.
One of the biggest stumbling blocks to further work on SMP in our kernel has been an interaction between the big kernel lock and mutexes. The path forward in that environment was to take entire subsystems and layers out from under the big lock in one go, However, in practice that has obviously not happened in any significant way because some of these layers are quite large (eg, the block layer and network stacks) and need the hardware device drivers interacting with that layer to come with them. All of them. For example, we have approximately 140 network drivers across all our architectures that would have to be made biglock free before we could take the network stack out too. Moving only a portion of these layers out at a time guarantees we will experience deadlocks which means we can't do it.
kettenis@ has recently implemented a change to mutexes that lets us sidestep those deadlocks, which in turn means we can get on with making evolutionary changes rather than revolutionary changes to improve our SMP scalability. I spent some days taking advantage of that by moving some drivers and layers I am familiar with toward fine grained locking, specifically myx(4), mpi(4), mpii(4), sd(4), the SCSI midlayer represented by scsibus(4), and some of the input side of the network stack.
That code didn't take long to write and it seemed to work, but it was hard to have any confidence in it or judge whether it improved anything or not.
One of the common concerns with moving forward with SMP is introducing performance regressions. If we have to take and give up locks (and the big lock in particular) then we could be adding more places to spin on the locks rather than actually doing more work without the lock held. The other problem with SMP is that it turns kernel from something kind of like a single process with signal handlers into a heavily threaded program with shared memory and locks. Threaded programs are a lot harder to reason about and observe. It is especially difficult in the kernel because the existing debug infrastructure can affect the timing or locking of the kernel, which can often hide the problem or make it unusably slow.
To better measure and observe what the kernel was doing, I implemented a basic event tracing facility for the kernel. This tracing showed up two interesting things.
Firstly it showed that my SMP safe code wasn't actually being run without the biglock because the interrupt code on the sparc64 I was working on hadn't been updated to respect the MP safety flag for interrupt handlers. kettenis and I quickly resolved that problem and I started testing again.
Second, it showed that there were spots where the big kernel lock was being held for "very long" periods of time, which in my world (well, the code) is about 50 microseconds. It turns out this problem is not SMP specific and these pauses occur without my tweaked drivers. The two pauses I found were related to memory management.
One of the pauses was generated by the buffer layer and it's management of the cache. When memory is available, the cache was supposed to sit between a high and low watermark. When the high watermark was about to be exceededit decided to free memory down to the low watermark. If you have a lot of memory in a machine the difference between the high and low watermark can be quite big, and it frees it one page at a time. The more memory you have the longer it will take.
I came up with a diff that restricted it to freeing only 8 pages at a time, and then deferring the rest of the frees so they could be interleaved with other work in the kernel. beck@ looked at the problem in a bit more depth and determined that simply removing the low watermark and simply maintaining the high watermark was a better solution. That change is now in the tree.
The other pause comes from large processes exiting and having their memory returned to the memory management layer, also one page at a time. There's still ongoing discussion about how to deal with that properly. In the meantime, don't use Firefox on your firewalls.
I also worked on improving the APIs available for working on SMP systems in the kernel. As a result of that I have committed in an API for doing various atomic operations which originally came from Solaris via NetBSD. I also implemented ticket locks for the big giant lock on sparc64. That, and previous work I'd done on ticket locks for amd64 and i386 should hit the tree after release. I hope at some point to also introduce something to help with reference counting in the kernel, which appears to be an extremely important but often overlooked part of dealing with SMP. The introduction of the atomic API will make doing that a lot easier.
My "take home message" from the week in Dunedin is that SMP is extremely hard and requires a huge amount of mental energy compared to any of my previous development work in OpenBSD. Because of where we are in the development cycle only a chunk of the code I wrote for SMP has hit the tree yet, but that's not a bad thing. I committed the bits of myx(4) I was confident in, but after deploying it at work when I got back from Dunedin I discovered two really nasty races in my code, one of which caused an outage of one of the firewalls. Fixing those two simple races took maybe another 3 days of my time to work through. It's hard. Or I haven't had enough experience yet.
(Comments are closed)