OpenBSD Journal

Developer blog: claudio

Contributed by marco on from the OpenBGPD-status dept.

Probably you already noticed the mail about the OpenBGPD status I sent to misc@ a day ago. The first reaction I got was marco@ asking me to write something for undeadly. Why not, let's write something for undeadly.

Good, I don't live in Texas and I don't have cool friends hacking on ACPI but that should not diminish this story. This story is about OpenBGPD. BGP -- the border gateway protocol -- is THE routing protocol of the internet. It is used to exchange and compute the routing table for backbone routers.

Over the last few weeks a lot of development happened in OpenBGPD. Most of it was preparation work for one big missing feature. This feature is soft reconfiguration. Soft reconfiguration means that it is possible to reconfigure BGP sessions without restarting them. Restarting a BGP session is normally the last resort as losing a session means losing part of the connectivity. In OpenBGPD it was already possible to reload the configuration. So you could add new session or remove no longer needed sessions without interrupting the other BGP sessions. But there was one thing that did not work well. Changing parameters or filters. Filters are used to modify or block certain prefixes. Filtering is used to engineer the outgoing traffic by preferring e.g. cheaper uplinks. Now the problem is that if the config was reloaded filter changes were not applied to the current loaded prefixes only newly added prefixes will see changes. This results in undefined and often unexpected behaviour and so in most cases it was necessary to clear the BGP session after a reload. This is where soft reconfiguration kicks in.

Filters work in two direction and so does soft reconfiguration. Doing softreconfig out is simple. Take the old and new filter settings and run over the database. If there is a difference between the output of the old and new filters an update needs to be sent to the neighbor. OpenBGPD supported softreconfig out since a few month and it is enabled without any special configuration. Softreconfig in is a different beast. It is not possible to run over the RIB (router information base) with the new and old filters to find differences because the incoming filters were already run and we no longer know the original state. So it is necessary to store the original announcements in an own tree. This is the so called Adj-RIB-In (there is also a Adj-RIB-Out but OpenBGPD does not have such a tree. This tree is calculated on demand). Storing all original announcements in an own tree drives the memory consumption massively up. Zebra or Cisco systems suffer from this problem and so incoming soft reconfiguration is turned off by default on these systems and network admins use it only for very important sessions. OpenBGPD does it differently.

I spent a long time figuring out how to do it better. Having duplicated information in the RIB is waste of memory. A busy BGP router with many full feeds (currently 175k routes) has more than a million allocated objects and uses easily 100MB and more (if you are running zebra it is way more). Doing it better was the key behind most of OpenBGPDs development effort so why should we change a successful process? I knew soon that merging the Adj-RIB-In into the already existing Local-RIB was the way to go. Unmodified prefixes will not waste additional memory because of multiple copies. Sounds easy but isn't. OpenBGPDs RIB is a maze consisting of several structures that are linked together. There are many dependencies and preconditions. I preferred to make many small steps extending only part of the RIB. First some cleanup was necessary. Afterwards I modified the attribute storage to work via reference. This reduced the memory consumption. Then an additional flag field had to be added to struct prefix. struct prefix is the most allocated structure in bgpd and so the size of the struct is very important. OpenBSD malloc() has fixed bucket sizes. They are a multiple of 2 (8, 16, 32, 64, ...) or a bunch of pages for large allocations. So growing struct prefix from 32 bytes to 33 bytes would almost double the memory requirements of OpenBGPD because malloc would use 64 byte buckets instead of 32 bytes ones. This was not an option and so a different not so important field had do be removed. Based on this new flag filed it was possible to mark prefixes. This is needed to distinguish stuff belonging to the Local-RIB from the Adj-RIB-In. A very important precondition in the RDE was that only one prefix per peer and destination may exist. This is no longer true and so many functions have been adapted to the new situation. A couple of commits later this was done. While doing this I found a few additional memory leaks and other minor bugs. Finally I modified the update code to correctly merge the information. If the Local-RIB and the Adj-RIB-In had the same info no additional data needs to be allocated. If the path attributes differ the modified path needs to be stored and an additional prefix needs to be added. In the end I came up with about a dozen different scenarios. Figuring it out with paper and pencil helped a lot but in the end the first tests resulted in strange behaviour and later even in crashes :(. Hey nobody is perfect and for some reasons we ship OpenBSD with gdb. By setting a few break points and inspecting some data structures one bug after the other was hunted down and fixed. Additionally the malloc option J helped me trigger a bug easily -- while I merged prefixes correctly on updates I total forgot about that while removing them later and so prefixes belonging to both RIBs where removed and freed even though one reference was still valid. Doh.

So that's the story about softreconfig and now I should finally finish my promised OpenBGPD filter rewrite but that's a different story.

(Comments are closed)


Comments
  1. By panda (193.252.148.11) on

    Thanks for the story behind the latest commits, I'm especially interested in ospfd and bgpd these days so keep posting!

  2. By Anonymous Coward (208.146.43.5) on

    Thanks for the update.
    This is an excellent example of one of the main reasons I use OpenBSD; the developers' constant commitment to do it right and do it better.
    Great work!

  3. By guilherme (201.32.137.252) gmmacedo@terra.com.br on www.inf.ufrgs.br/~gmmacedo

    That's why I like and use OpenBSD and it's related projects, because of the transparency in all the development of the projects.
    Thanks and keep the good work.

  4. By Jasper (80.60.145.215) on

    Nice to see that another developer has a blog here! I hope that more will follow. Regarding OpenBGPD: keep up the good work together with Henning!

  5. By Anonymous Coward (64.235.236.6) on

    using both BGP and OpenBSD daily at work (but not OpenBGPD) makes this blog posting *very* interesting. I can't thank you enough for writing it claudio

  6. By daniel (82.131.15.35) on http://septum.org

    i don't use bgp at all, but these developer stories are still interesting

    keep 'em coming i say, and thanks for the stories so far! :)

  7. By shef (212.58.214.69) on

    it will be very nice, if the traceroute will be have AS numbers in the output.

    Comments
    1. By Pete (80.203.236.21) on

      I agree ! I find IOS's traceroute with [AS num] _very_ handy when tracking down problems. Ideally OpenBSD's traceroute would lookup AS num for hops from the fib (or rib if not coupled in to fib) via the new low priv socket i guess.

      On the same note IOS's traceroute also decodes the ICMP MPLS tags as per draft-ietf-mpls-icmp-04.txt , incorporation of that at the same time would be awesome.


      p.s. I know, I know, I should just shut up 'n' code it...

      /Pete

  8. By Anonymous Coward (81.57.42.108) on

    I'm not sure to understand from where the new byte necessary to put in a new flag was taken from.
    "A very important precondition in the RDE was that only one prefix per peer and destination may exist. This is no longer true and so many functions have been adapted to the new situation." -> does this means that this "only one prefix per peer" enforcement needed a field on the in memory rib structs, and that this field could be removed (giving up the room for a flag discriminating Adj/Local RIB) ?

    Very interesting post. We can get how carefull the OpenBGPD's dev process is (here, carefull about subtles details on memory consumption) and that's what we love on OpenBSD in general !

Credits

Copyright © - Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]