Contributed by jj on from the why-did-it-had-to-be-radixes dept.
I started this year with some nice hiking in New Zealand just before the hackathon. Once I ended up in Dunedin at the University of Otago there were two main things I wanted to do. First of all there was a rather serious bug in the graceful reload handling of bgpd which caused stale routes to remain in the RIB and FIB resulting in bad routing decisions.
I had a fix for this finished before the hackathon so the main goal was to get people to OK it and finally commit it. That was rather easy since henning@ and benno@ joined the day after I arrived and so the diff was committed soon. The second problem I wanted to solve turned out to be more of a bottomless pit filled with old pointy rusty code.Thanks for the report on your adventures.
In Berlin at b2k13 benno@ and florian@ showed me an issue where the interaction between the kernel, ospfd and bgpd caused an internal inconsistency in bgpd's FIB table. I was able to identify the issue fairly quickly thanks to a very detailed bug report from an OpenBSD user. It turned out that the kernel routing table does not properly handle RTM_DELETE in some cases where multiple routes with different priorities are in play. Knowing what the issue was I started to dig into the routing code.
The OpenBSD routing table is based on the original patricia tree implementation that one overly clever person had written. It is amazingly efficient but only for computers and compilers of the early 90s. Over the years a fair amount of extensions where glued onto the base implementation which makes the code even harder to understand. Since I realized that a fair amount of changes were needed to the code I decided to revamp the code a bit and bring it to modern times. It started simple with a few quick changes but then layer after layer came off and I was knee deep in some of the trickiest code bits of our kernel. I started to feel like Puffiana Jones in a maze filled with deadly traps.
I spent most of the week refactoring the code and committing bit by bit in little steps. Never moving too fast to not end up impaled by kernel panics or crushed by some big memory corruption. From time to time I took a break and chatted with dlg@ about some plans to run parts of the network stack without big lock. In the end the hackathon was over way to fast. I was still in the middle of the maze moving slowly; going one step forward at a time and too often 2 steps back. Each time spending a fair amount of time debugging my changes. And way to often I ended up fighting with bad radix nodes or inconsistent radix mask list used in the backtracking path.
Now I'm back home and over the last weeks I think I tracked down and crushed hopefully all the bugs hiding in the code responsible to add routes. In the next weeks I hope to do the same for the delete path and last but not least I should be able to finally fix the actual issue which took me down this crazy journey.
(Comments are closed)