OpenBSD Journal

Follow Up - PF Internals Redesign

Contributed by merdely on from the pf-got-moar-better dept.

Following the recent Call for Testing story and the n2k8 Part 7 story, Mark Uemura (mtu@) sent this write-up from Ryan McBride (mcbride@) regarding the recent PF state table rearrangement commits (first, second) and the future of PF:

OVERVIEW

These changes complete the split between the layer 3/4 addressing information (state key) and the "extra" tracking information held in the state.

More of Ryan's follow-up below.

More specifically, we deprecate the (often difficult to understand) concept of lan, ext, and gwy addresses, replacing them with WIRE and STACK side address tuples (address family, protocol, source address and port, destination address and port).

This means we have a small amount of duplication in the NAT case, but we save space when no NAT is taking place.

Besides some minor differences in how pfctl(8) and tcpdump(8) print states, these modifications do not alter any of PF's functionality, but rather lay the groundwork for some more interesting changes we are planning.

IMPLEMENTATION NOTES

When we began discussing the implementation details, we encountered some difficulties with terminology: source, destination, and direction are all heavily overloaded terms in pf.c and some people find the lan/ext/gwy address names in the code difficult to interpret.

With these commits, we introduce the concept of WIRE and STACK side address pairs.

The WIRE side addresses are the addresses as seen on the wire. The STACK side addresses are the addresses as seen from the perspective of the kernel internals, and the routing code in particular.

The key structs in question:

struct pf_state_key {
        struct pf_addr   addr[2];
        u_int16_t        port[2];
        sa_family_t      af;
        u_int8_t         proto;
        u_int8_t         pad[2];

        RB_ENTRY(pf_state_key)   entry;
        struct pf_statelisthead  states;
};

struct pf_state {
        ...
        /* addresses stack and wire */
        struct pf_state_key     *key[2];
        ...
}
Each state will have two state keys assigned, one WIRE side, and one STACK side. Each state_key also has a TAILQ of state_items pointing to individual states which use this key, sorted to ensure that interface-bound states are found first.

State keys are always stored in the same address order, such that the source address/port comes first in the ip_input() path. Because of this, the same key will be both the WIRE and STACK side key for a single state in the no-NAT case.

The core logic of this change to understand is pf_find_state() [ and secondarily pf_state_key_setup()/pf_state_key_insert() ]

Inbound:

Packets come off the WIRE, and we search for a state_key matching the correct tuple). We then walk the linked list for a state with the correct kif and a key[PF_SK_WIRE] matching the state_key we've found.

Outbound:

Packets come off the STACK, and we search for a state_key matching the correct tuple, with address and ports swapped. We then walk the linked list for a state with the correct kif and a key[PF_SK_WIRE] matching the state_key we've found.

We check if NAT is in effect by comparing key[PF_SK_WIRE] to key[PF_SK_STACK]; if they're different, we're doing NAT. We then copy the addresses from the correct state key (WIRE on the outbound, STACK on the inbound) into the packet and adjust the checksums.

Sounds simple, but because the state handling is an integral part of PF, the changes required (particularly for NAT) were substantial, and the diff is more than 3500 lines.

WHAT'S NEXT?

The following are some of the things that these commits enable:

  • Some major performance improvements. First, we will link inbound and outbound states, skipping the state_key search in the ip_output() path.
    Once this is working, we can cache the results of some other lookups, skipping them if we've already got the correct information in the state:
    • routing table lookups
    • IPsec
    • TCP/UDP pcbs
  • NAT inbound, RDR outbound. This is particularly interesting for those who want to do redir or nat of traffic to/from the PF box itself; it will also make it easier to NAT networks across IPsec tunnels to deal with conflicting address allocations.
  • The state_item struct will allow an arbitrary number of state lists, which will allow pfsync to delay conversion from pf_state to pfsync_state until the packet is sent. This will make pfsync more efficient and allow us to simplify the code.
  • There are a number of functions that can be cleaned up now that this diff is in.

We will also look more carefully at:

  • For traffic flowing through the firewall, only do state checks (tcp window, paws, etc) once. Keep only one pf_state object.
  • Separate state_key structures for IPv4 and IPv6, saving memory in the IPv4 case and speeding up RB_TREE operations.

And at this point ideas are coming faster than we can implement them, so we'll likely come up some more when we're all together in Edmonton in a few weeks.

Thank you to Mark and Ryan for clarifying the new PF changes.

(Comments are closed)


Comments
  1. By Rich (195.212.199.63) on

    It's good to see that development and tuning of pf is on-going.

    Could I also ask if the documentation will be improved? I refer to the documentation for hooking into pf rather than that for setting up pf rules. Some time ago, I wrote a little prog to allow me to fiddle about with pf tables based on external events. I managed to do it, but the API to do it is (or certainly was back then - a couple of years ago) pretty much non-existent and the code (esp. the header files) is pretty much free of any useful comments, requiring a lot of guess work in using the API; it's difficult to use an API when you have no idea what any of the parameters mean (it's not specific to OBSD of course, but there seems to be an all-too common failure in (esp. unixy) header files to actually describe what stuff actually means and how it should be used). I think I only managed to do it in the end because I managed to find a scrap of code on the web that used some of the API I needed.

    As I say, this was a while ago, and I've not looked to see if things have improved recently.

    Comments
    1. By Anonymous Coward (69.251.202.202) on

      > It's good to see that development and tuning of pf is on-going.
      >
      > Could I also ask if the documentation will be improved? I refer to the documentation for hooking into pf rather than that for setting up pf rules. Some time ago, I wrote a little prog to allow me to fiddle about with pf tables based on external events. I managed to do it, but the API to do it is (or certainly was back then - a couple of years ago) pretty much non-existent and the code (esp. the header files) is pretty much free of any useful comments, requiring a lot of guess work in using the API; it's difficult to use an API when you have no idea what any of the parameters mean (it's not specific to OBSD of course, but there seems to be an all-too common failure in (esp. unixy) header files to actually describe what stuff actually means and how it should be used). I think I only managed to do it in the end because I managed to find a scrap of code on the web that used some of the API I needed.
      >
      > As I say, this was a while ago, and I've not looked to see if things have improved recently.

      Header files are NOT documentation.

      man 4 pf

      If you still don't get it, look at the source to pfctl.

      Comments
      1. By Anonymous Coward (195.212.199.63) on

        > Header files are NOT documentation.
        >
        > man 4 pf
        >
        > If you still don't get it, look at the source to pfctl.

        So header files are not documentation but "source" is? :-)

        As it has been a while, I've just gone back to the latest pf(4) man page (and yes, I have referred to it many times in the past) and just to pick out some examples at random...

        1/ DIOCGETSTATUS (Get the internal packet filter statistics)
        Lots of fields here, not much description of what they actually are.

        2/ DIOCNATLOOK (Look up a state table entry by source and destination addresses and ports)
        Again, lots of fields - no idea what most of them do though.

        3/ DIOCRCLRTABLES (Clear all tables)
        Errr.... shall I go on?

        The pf API is pretty large and certainly pretty complicated and extensive. All I'm saying is that the docs could be better by providing some constructive criticism. You are absolutely right in saying that the header files are not documentation. Unfortunately, the documentation (man page) consists largely of cut & paste lumps of the header file. QED

        Comments
        1. By tedu (207.99.73.226) on


          > 2/ DIOCNATLOOK (Look up a state table entry by source and destination addresses and ports)
          > Again, lots of fields - no idea what most of them do though.

          It's not all spelled out, but I think it's pretty clear what's going on as soon as you think about it.

          "Hmm, what could u_int16_t sport be? pf doesn't play sports! And why's it 16 bits? What kind number could have 16 bits in it? I think port numbers do. So maybe it's an s-port. But what's the s stand for? Wait, what were we looking up again, source and destination ports? Well, destination does have an s in it, but it's not the first letter. So I guess s must stand for source."

          There's even an example at the end for this ioctl that makes it really clear.

          Granted, there could be a little comment for every field, and the stats ioctl is a little ridiculous.

          If you can't figure it out, just ask (in the appropriate place - misc. Complaining on undeadly rarely results in any action). "I'm trying to use DIOCGETSTATUS and I don't know what the bcounters field is for. Can someone explain it?"

          Comments
          1. By Rich (195.212.199.63) on

            Yes, ok, the example you mention is pretty obvious, and there are, no doubt, many other obvious examples, but there are just as many very non-obvious examples and the interaction between this mass of variables is not really clear without spending a long time going through the code (and as has been pointed out, code is not documentation).

            As for complaining - I wasn't! I was just commenting - that's all. As for asking misc, unfortunately I suspect such a question would be met with a load of abuse or (at best, and editing out the obscenities) "read the manual and stop wasting our time" (and we've all seen perfectly reasonable questions like this being shot down on the mailing lists - I just don't bother asking anything on them any more; it's not worth the grief)

            But forget it - it seems that it doesn't matter how reasonable or obvious a point it, many many people will come up with examples to "prove" it's wrong. I don't mind a discussion and a healthy argument and I'm happy to be proved wrong, but I'm bailing out of this conversation before it starts getting abusive (I'm not pointing the finger at you btw), which, sadly, it inevitably will.

            Comments
            1. By Marc Espie (213.41.185.88) espie@openbsd.org on

              As usual: patches welcome. Or real-life examples.

              Developers like pyr@ don't see the documentation as incomplete.

              They don't want to fuck with you, it's just a blind spot, it's pretty hard to write documentation for newbies when you don't know what (to you) obvious clue they're missing.

              Complaining about it like you do is about the worst possible way to try to get things moving.

              Comments
              1. By Anonymous Coward (24.37.242.64) on

                > As usual: patches welcome. Or real-life examples.
                >
                > Developers like pyr@ don't see the documentation as incomplete.
                >
                > They don't want to fuck with you, it's just a blind spot, it's pretty hard to write documentation for newbies when you don't know what (to you) obvious clue they're missing.
                >
                > Complaining about it like you do is about the worst possible way to try to get things moving.
                >
                >

                Nice reply, but just to throw in my thoughts - I didn't read his posting as a 'complaint' at all, just constructive criticism in a public forum and I think that's a good thing.

                I'm sure more people follow undeadly.org than say misc@.

                It's unfortunate others before just jump the gun and start being rude or ignorant about things.

                Comments
                1. By Pierre-Yves Ritschard (81.57.42.68) pyr@spootnik.org on


                  >
                  > It's unfortunate others before just jump the gun and start being rude or ignorant about things.
                  >

                  Wow, I don't see much gun jumping so far, only a few developers trying to say the obvious: pf(4) is not a book, it has all the information you need, but it won't make development a less involved process.

                  Marc is completely right in saying we don't know what you're looking for, maybe you should help us with diffs, but keep in mind these man pages will never be an essay about pf internals, they should just document the interface between userland and kernel.

                  Again, the pf part of relayd (back then called hostated) was written with pf(4) as my only documentation it was my first big software project.

                  I also might add that OpenBSD is the only system where such thorough documentation is available, I took a short trip in ipvs land and you're completely on your own there.

            2. By Anonymous Coward (207.99.73.226) on

              > As for complaining - I wasn't! I was just commenting - that's all. As for asking misc, unfortunately I suspect such a question would be met with a load of abuse or (at best, and editing out the obscenities) "read the manual and stop wasting our time" (and we've all seen perfectly reasonable questions like this being shot down on the mailing lists - I just don't bother asking anything on them any more; it's not worth the grief)

              I think that's an unfair characterization of misc (at least sometimes). Where do you think the example in the man page came from after all? Just search misc for DIOCNATLOOK.

              http://marc.info/?l=openbsd-misc&w=2&r=1&s=DIOCNATLOOK+&q=b

              The last hit, an oldie but a goodie, is dhartmei helping out some totally clueless idiot who couldn't figure out a simple ioctl. :)

              Comments
              1. By Anonymous Coward (165.2.186.10) on

                > > As for complaining - I wasn't! I was just commenting - that's all. As for asking misc, unfortunately I suspect such a question would be met with a load of abuse or (at best, and editing out the obscenities) "read the manual and stop wasting our time" (and we've all seen perfectly reasonable questions like this being shot down on the mailing lists - I just don't bother asking anything on them any more; it's not worth the grief)
                >
                > I think that's an unfair characterization of misc (at least sometimes). Where do you think the example in the man page came from after all? Just search misc for DIOCNATLOOK.
                >
                > http://marc.info/?l=openbsd-misc&w=2&r=1&s=DIOCNATLOOK+&q=b
                >
                > The last hit, an oldie but a goodie, is dhartmei helping out some totally clueless idiot who couldn't figure out a simple ioctl. :)

                Calling the guy a clueless idiot doesn't really help your "unfair characterization" defense.

                Comments
                1. By Pierre Riteau (82.240.212.223) on

                  > > > As for complaining - I wasn't! I was just commenting - that's all. As for asking misc, unfortunately I suspect such a question would be met with a load of abuse or (at best, and editing out the obscenities) "read the manual and stop wasting our time" (and we've all seen perfectly reasonable questions like this being shot down on the mailing lists - I just don't bother asking anything on them any more; it's not worth the grief)
                  > >
                  > > I think that's an unfair characterization of misc (at least sometimes). Where do you think the example in the man page came from after all? Just search misc for DIOCNATLOOK.
                  > >
                  > > http://marc.info/?l=openbsd-misc&w=2&r=1&s=DIOCNATLOOK+&q=b
                  > >
                  > > The last hit, an oldie but a goodie, is dhartmei helping out some totally clueless idiot who couldn't figure out a simple ioctl. :)
                  >
                  > Calling the guy a clueless idiot doesn't really help your "unfair characterization" defense.

                  Look at the IP from where the message you replied to was posted, and then again for the message above which says to ask on misc@.

                  Comments
                  1. By Anonymous Coward (2a01:348:108:155:20a:e4ff:fe2d:99ee) on

                    > Look at the IP from where the message you replied to was posted, and then again for the message above which says to ask on misc@.

                    well, that and look at who sent the question on misc@...

      2. By Rich (195.212.199.63) on

        Sorry - the above post was from me - forgot to add my name

    2. By Henrik Gustafsson (gsson) on http://fnord.se/

      > It's good to see that development and tuning of pf is on-going.
      >
      > Could I also ask if the documentation will be improved?

      What are you talking about? The PF APIs are very well documented, and have been for as long as I've been interested in reading them at least. Apart from that little doc bug back in 2005 which was promptly fixed (as always) when pointed out.

      The pf(4) man-page together with pfctl and pf_ioctl.c for examples is really all you need.

      // gsson

    3. By Pierre-Yves Ritschard (193.252.118.2) pyr@spootnik.org on

      > It's good to see that development and tuning of pf is on-going.
      >
      > Could I also ask if the documentation will be improved? I refer to the documentation for hooking into pf rather than that for setting up pf rules. Some time ago, I wrote a little prog to allow me to fiddle about with pf tables based on external events. I managed to do it, but the API to do it is (or certainly was back then - a couple of years ago) pretty much non-existent and the code (esp. the header files) is pretty much free of any useful comments, requiring a lot of guess work in using the API; it's difficult to use an API when you have no idea what any of the parameters mean (it's not specific to OBSD of course, but there seems to be an all-too common failure in (esp. unixy) header files to actually describe what stuff actually means and how it should be used). I think I only managed to do it in the end because I managed to find a scrap of code on the web that used some of the API I needed.
      >
      > As I say, this was a while ago, and I've not looked to see if things have improved recently.

      It's funny you should mention that because I wrote an application to allow me to fiddle about with pf tables based on external events too. I found all I was looking for in pf(4) ! The application had various names but is called relayd now ;-)

  2. By jared spiegel (70.101.0.7) on

    spurious trailing quote in n2k8 href way above
    ``http://undeadly.org/cgi?action=article&sid=20080529012745%22''

    Comments
    1. By Mike Erdely (merdely) on http://erdelynet.com/

      > spurious trailing quote in n2k8 href way above
      > ``http://undeadly.org/cgi?action=article&sid=20080529012745%22''

      Almost. Missing leading quote. Thanks for letting me know.

      -ME

Latest Articles

Credits

Copyright © - Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]