Contributed by marco on from the OpenBGPD-status dept.
Good, I don't live in Texas and I don't have cool friends hacking on ACPI but that should not diminish this story. This story is about OpenBGPD. BGP -- the border gateway protocol -- is THE routing protocol of the internet. It is used to exchange and compute the routing table for backbone routers.
Over the last few weeks a lot of development happened in OpenBGPD. Most of it was preparation work for one big missing feature. This feature is soft reconfiguration. Soft reconfiguration means that it is possible to reconfigure BGP sessions without restarting them. Restarting a BGP session is normally the last resort as losing a session means losing part of the connectivity. In OpenBGPD it was already possible to reload the configuration. So you could add new session or remove no longer needed sessions without interrupting the other BGP sessions. But there was one thing that did not work well. Changing parameters or filters. Filters are used to modify or block certain prefixes. Filtering is used to engineer the outgoing traffic by preferring e.g. cheaper uplinks. Now the problem is that if the config was reloaded filter changes were not applied to the current loaded prefixes only newly added prefixes will see changes. This results in undefined and often unexpected behaviour and so in most cases it was necessary to clear the BGP session after a reload. This is where soft reconfiguration kicks in.
Filters work in two direction and so does soft reconfiguration. Doing softreconfig out is simple. Take the old and new filter settings and run over the database. If there is a difference between the output of the old and new filters an update needs to be sent to the neighbor. OpenBGPD supported softreconfig out since a few month and it is enabled without any special configuration. Softreconfig in is a different beast. It is not possible to run over the RIB (router information base) with the new and old filters to find differences because the incoming filters were already run and we no longer know the original state. So it is necessary to store the original announcements in an own tree. This is the so called Adj-RIB-In (there is also a Adj-RIB-Out but OpenBGPD does not have such a tree. This tree is calculated on demand). Storing all original announcements in an own tree drives the memory consumption massively up. Zebra or Cisco systems suffer from this problem and so incoming soft reconfiguration is turned off by default on these systems and network admins use it only for very important sessions. OpenBGPD does it differently.
I spent a long time figuring out how to do it better. Having duplicated information in the RIB is waste of memory. A busy BGP router with many full feeds (currently 175k routes) has more than a million allocated objects and uses easily 100MB and more (if you are running zebra it is way more). Doing it better was the key behind most of OpenBGPDs development effort so why should we change a successful process? I knew soon that merging the Adj-RIB-In into the already existing Local-RIB was the way to go. Unmodified prefixes will not waste additional memory because of multiple copies. Sounds easy but isn't. OpenBGPDs RIB is a maze consisting of several structures that are linked together. There are many dependencies and preconditions. I preferred to make many small steps extending only part of the RIB. First some cleanup was necessary. Afterwards I modified the attribute storage to work via reference. This reduced the memory consumption. Then an additional flag field had to be added to struct prefix. struct prefix is the most allocated structure in bgpd and so the size of the struct is very important. OpenBSD malloc() has fixed bucket sizes. They are a multiple of 2 (8, 16, 32, 64, ...) or a bunch of pages for large allocations. So growing struct prefix from 32 bytes to 33 bytes would almost double the memory requirements of OpenBGPD because malloc would use 64 byte buckets instead of 32 bytes ones. This was not an option and so a different not so important field had do be removed. Based on this new flag filed it was possible to mark prefixes. This is needed to distinguish stuff belonging to the Local-RIB from the Adj-RIB-In. A very important precondition in the RDE was that only one prefix per peer and destination may exist. This is no longer true and so many functions have been adapted to the new situation. A couple of commits later this was done. While doing this I found a few additional memory leaks and other minor bugs. Finally I modified the update code to correctly merge the information. If the Local-RIB and the Adj-RIB-In had the same info no additional data needs to be allocated. If the path attributes differ the modified path needs to be stored and an additional prefix needs to be added. In the end I came up with about a dozen different scenarios. Figuring it out with paper and pencil helped a lot but in the end the first tests resulted in strange behaviour and later even in crashes :(. Hey nobody is perfect and for some reasons we ship OpenBSD with gdb. By setting a few break points and inspecting some data structures one bug after the other was hunted down and fixed. Additionally the malloc option J helped me trigger a bug easily -- while I merged prefixes correctly on updates I total forgot about that while removing them later and so prefixes belonging to both RIBs where removed and freed even though one reference was still valid. Doh.
So that's the story about softreconfig and now I should finally finish my promised OpenBGPD filter rewrite but that's a different story.
(Comments are closed)
By panda (193.252.148.11) on
By Anonymous Coward (208.146.43.5) on
This is an excellent example of one of the main reasons I use OpenBSD; the developers' constant commitment to do it right and do it better.
Great work!
By guilherme (201.32.137.252) gmmacedo@terra.com.br on www.inf.ufrgs.br/~gmmacedo
Thanks and keep the good work.
By Jasper (80.60.145.215) on
By Anonymous Coward (64.235.236.6) on
By daniel (82.131.15.35) on http://septum.org
keep 'em coming i say, and thanks for the stories so far! :)
By shef (212.58.214.69) on
Comments
By Pete (80.203.236.21) on
On the same note IOS's traceroute also decodes the ICMP MPLS tags as per draft-ietf-mpls-icmp-04.txt , incorporation of that at the same time would be awesome.
p.s. I know, I know, I should just shut up 'n' code it...
/Pete
By Anonymous Coward (81.57.42.108) on
"A very important precondition in the RDE was that only one prefix per peer and destination may exist. This is no longer true and so many functions have been adapted to the new situation." -> does this means that this "only one prefix per peer" enforcement needed a field on the in memory rib structs, and that this field could be removed (giving up the room for a flag discriminating Adj/Local RIB) ?
Very interesting post. We can get how carefull the OpenBGPD's dev process is (here, carefull about subtles details on memory consumption) and that's what we love on OpenBSD in general !