Wrapping up a week in Coimbra...

Contributed by phessler on from the drinking-coffee-in-fibonacci dept.

Right now (Nov 14 - 20, 2012) c2k12 is happening in Coimbra, Portugal and there are 10 thirsty developers working hard on all sorts of projects. Bob Beck (beck@) writes:
I'm on my last day or so of hacking in Coimbra. It's been a very good week here. I've chased down a number of interesting bugs related to the recent commits I have done in the buffer cache area, found a number of use after free cases. I've spent my time chatting with Philip (guenther@openbsd.org) about tough stuff and getting his opinions on things, and been instrumenting the code to "fast recycle" for debugging rather than the usual "slow" reuse of oldest things first. Doing this has found a few places where we were potentially unsafe. I've also had a lot of fun watching and helping (or at least providing advice from the peanut gallery) as to the fun different ways guenther@ is blowing up the kernel by means of torturing multithreaded processes to fix threading bugs. (I'll let him explain more if he posts here.)

So that may sound like a "breakathon" instead of a "hackathon" but that's been very helpful to me. I'm on the way to finishing up the work to stabilize how the buffer cache behaves in low memory. I'm tired, somewhat overcaffienated (the coffee here is awesome) but happy.

A big thanks to Pedro Almeida as well as Rodolfo Gouveia and Luis Pinto for Setting things up and looking after us so well at the University of Coimbra.
Henning Brauer (henning@) also tells us:
humh, hard to describe without making a loooooooong essay.

Besides the obvious cafes and galaos, I ended up working on the checksums all time in Coimbra, again. I wanted to finish the coding part of the new bandwidth shaper - besides the actual code, there is documentation to write and more testing to do. But since naddy@ identified a problem with the checksumming diff after I had it committed, I had to back it out to get that fixed.

The checksum diff I am talking about here is an overhaul on how we deal with the protocol checksums - tcp, udp, icmp, icmp6. The driving force is the handling we have in pf.

The protocol checksums, and foremost this applies to tcp and udp here, aren't just calculated and written out. For now it happens incrementally. The template PCBs (Protocol Control Block) have the fields that are "static" checksummed and that checksum in them. Once there is a connection and we have a real PCB (copied from the template) we know the IP addresses, they get added, and the checksum updated. The checksum here covers only _parts_ of the IP header, it's called "pseudo header checksum".

Way later, namely long after pf, the checksum is again updated to cover the tcp/udp header and the payload.

That means that, in pf, we can see packets with either a full checksum, in the forwarding case, or with just the pseudo header checksum. When pf is doing rewrites to the packets (e. g. NAT), it has to update the checksum of course. There is no obvious way to see whether we have a full cksum or just a pseudo header checksum. pf used to incrementally update the checksum when touching a header field, and if we touched something not covered by the pseudo header checksum on a packet that had only said pseudo header checksum, we would still "update" the checksum, which in turn breaks it - but not in all cases.

Now to make that complicated, add hardware checksum offloading. Almost all modern network cards have offload engines that can do these calculations. There is RX checksum offloading, where the card checks the existing checksum and flags the packets as good or bad. For the TX side, there are some that require the pseudo header checksum to be present, and there are some that don't care. And of course all that for udp and tcp, and inet and inet6. And there lies the reason why we have TX cksum offloading disabled on at least everything that requires the pseudo header checksum to be present, combined with the pf problem mentioned earlier that breaks.

And now of course for icmp it had to be completely different. There was nothing in our tree for icmp offloading (not sure such hardware exists at all). icmp checksums were handled at a completely different place, done completely different. I changed that to the same way tcp and udp are handled, but we end up in the software engine all the time of course.

This is very very complex, with lots of codepaths and many different cases, and the checksum calculation itself isn't exactly obvious either.

I changed things so that we would not update the checksums in pf at all, but just mark them for checksumming. In most cases that means the offload engine does it, if none is there I basically emulate one in software. Since the proto cksum covers the entire payload recalculating it is potentially expensive, since we have to access the entire packet, and not just the headers that'll be in cache at that point anyway (the actual math is cheap). However some benchmarking mikeb@ did indicates that even with the software emulated engine the speed is pretty much the same as with the old way in pf.

In the committed and backed out version we would potentially fix checksums on forwarded packets arriving with a broken one. That must not happen of course. Fixing that turned out to be way more complicated than thought, checking the checksum on each and every forwarded packet is not an option for performance reasons really, and you need to verify _before_ rewriting, and you don't want to check and/or recalculate twice either. Getting that right in all codepaths is everything but trivial.

With a lot of testing help from naddy@ and mikeb@ we identified a few other problem cases as well. I do have all fixed I know of and tested everything I could remotely imagine, so I do have hope it's finally done now - I am working on that since k2k11 in Iceland in April 2011.

And it makes kittens scared and my head wanting to explode.

Once that is in again and has shown to be solid we can go and enable the offload engines in many drivers where they are off now, and have fun with silicone bugs. After some minor cleanup that is possible then and likely a change to the pseudo header checksumming (idea from a discussion I had with andre@freebsd during eurobsdcon in Warsaw) I never ever want to be confronted with the proto cksum madness again.

No progress on the bandwidth shaper at all, sadly, due to the dreaded checksums.

(Comments are closed)

  1. By Michael Mitton (mwmitton) mmitton@gmail.com on

    While I know they do this work for their own reasons and gratification, I hope the devs know how much normal users appreciate all their hard work and dedication to OpenBSD. Thanks guys! I look forward to reading more from the hackathon


Copyright © - Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]