Redesign of the pfsync Protocol, Part 3

Contributed by jason on 2009-02-26 from the mad-scientists-shouldn't-have-this-much-fun dept.

In this installment (see also Part 1 and Part 2), David Gwynne (dlg@) describes the approach and execution of his redesign of the pfsync protocol. He explains the limitations of pfsync version 4 and how he overcame these obstacles.

It was at about this point that it was decided that the code required significant surgery to avoid transmitting too many pfsync packets. Since the code was going to have to be heavily modified to fix it's behaviour, slipping an update to the wire protocol was also allowed, especially if it would help mitigate the number of packets pfsync intended to transmit.

Active-Active Firewall Cluster Support in OpenBSD (continued)

Approach and Execution

As described in detail above, the big problem with the current implementation is that it mitigates sending of pfsync packets too much, ie, in a firewall cluster with traffic split over two peers, updates aren't exchanged rapidly enough for the states on each firewall to move forward fast enough to keep up with the actual traffic. This is especially (or perhaps only) problematic with TCP traffic, which requires extremely current information from both sides of the connection to move the TCP sequence numbers forward.

Two attempts were made to try to solve the active-active problem in pfsync, firstly simple modifications to the existing implementation, and then as a result of that an almost full blown rewrite of the code.

Changes To The pfsync v4 Implementation

A large number of different approaches at dealing with states with traffic split over two peers in an active-active firewall cluster were evaluated and tested as part of the initial problem solving. This stage could be considered the exploratory surgery and was required so I could gain familiarity with the problem and the current implementation. However, all of the solutions except the final solution presented here were rejected as being unsuitable.

These solutions ranged from allowing packets for split TCP states to accept packets if the sequence numbers are on the edge of the window as stored in the state, to decreasing the maximum timeout on pfsync packets from 1 second down to small fractions of a second. All of these solutions either compromised the security provided by pf, or hurt the performance too much in existing use cases to be feasible.

Despite the long road to the changes made to the v4 code, the final changes were actually quite minimal. After a lot of trial and error it was decided that it was necessary for each firewall involved in a split path for TCP sessions to be notified of updates to that state as soon as possible. This in turn required the ability to detect when two peers were involved in a split path.

Detecting a split path turned out to be simple. Whenever an update to a state is received via the pfsync protocol, we record the time the update arrived at in the pf state structure. Then, when an update to the same state arrives via pf, we simply find the difference between the current time and the last time the state was updated via pf and assume the state is split if the difference is less than some arbitrary value, 4 in our case.

If we never get an update about a state via pfsync it means no other peer was involved in handling that state, therefore the timestamp in the state will always be 0 (the default value). The comparison between it and the current time will always be greater than 4. A peer that is handling packets for that session will send pfsync packets out about it though, so that comparison will evaluate to true and we know the state is split at that point.

The other feature of this mechanism is if the paths in both directions for this state merge onto a single firewall, that firewall will no longer receive updates for the state via pfsync. The timestamp on the state will no longer be updated, and the comparison between it and the system time will eventually fail as time moves forward. This means the same mechanism for detecting split states also allows us to detect when a state is no longer split.

With that mechanism in place it was trivial to modify the code to immediately send a pfsync state update about our own state changes to the peer.

These changes were successful, ie, if you had a pair of firewalls called A and B between two networks x and y, you could configure the route from hosts on network x to network y to go via firewall A, and the routes from hosts on network y to network x on firewall B, you could then successfully run pf as a fully stateful firewall with traffic for the same session split between those two firewalls.

The problem with these code changes is that they caused pfsync to become extremely chatty. Every single packet involved in a split session going through a firewall would generate a corresponding update from pfsync. Worse, for every single update for that split session we received from the firewall peers, we hit the merge state case in the pfsync update parsing code which cause us to send an update out again. Because of this the majority of the test we did with async paths for traffic through firewalls showed the pfsync traffic between two firewalls was several times the traffic of the actual traffic we were forwarding over the firewalls. Obviously causing more load than what we're attempting to filter is unacceptable.

One of the discoveries made during the tinkering with the v4 code was that there was a race between the forwarding of a packet that caused the creation of a state, the pfsync packet it generates, and when the reply from the host the packet was for is seen by a peer.

If we forward that packet on to the host, and that host sends a reply through another firewall, it is likely that the 1 second timeout on the pfsync packet describing that state has not hit that second firewall yet. Because pf is stateful, it will probably reject or drop that reply rather than forward it on like it should.

In response to this problem a new pfsync message was created called PFSYNC_ACT_IACK. When a firewall creates a state, instead of forwarding the network packet that created the state on immediately, we delay transmission for a short period. While that first packet is delayed we immediately send a pfsync state insertion message. Peers that receive that state insertion message then send an insert acknowledgment message to the first firewall, which in turn uses that to trigger the transmission of the packet that was delayed. If no firewall is there to acknowledge the insert, a timeout on the packet fires and causes it to be transmitted anyway.

It was at about this point that it was decided that the code required significant surgery to avoid transmitting too many pfsync packets. Since the code was going to have to be heavily modified to fix it's behaviour, slipping an update to the wire protocol was also allowed, especially if it would help mitigate the number of packets pfsync intended to transmit.

To summarise, it was determined that not only does the pfsync code mitigate sending of pfsync packets too much, it also doesn't mitigate them enough.

pfsync v5

The only really major flaw with version 4 of the pfsync protocol was its inability to contain multiple types of messages within the same frame. It could only contain packets with state inserts, or updates, or deletions, or so on, but not a mix of those message types. This becomes a problem if you're trying to mitigate the number of packets sent, but needed to send a lot of messages of different types.

Therefore the only real change between pfsync v4 and pfsync v5 was the removal of the message type and counter fields in the pfsync header, and the introduction of a sub- header. Several different messages can now be placed in a pfsync packet, all prefixed by different sub-headers.

A new message type was added to the protocol as part of the new version too. To cope with the race between forwarding a packet that creates a state, and it's reply hitting a peer before the state was received by that peer, it was envisaged that the first peer forwarding the first packet would defer sending of that packet until it received an acknowledgment of the insert from a peer. That acknowledgment is represented by a new PFSYNC_ACT_IACK message type that simply contains the creator and state ids of the state that was just inserted.

To summarise, pfsync v5 is largely the same as pfsync v4, except for a new message type (IACK) and the ability to send multiple types of messages inside a single pfsync packet due to the addition of a sub-header in the frame format.

The following changes to the OpenBSD kernel were made to address the inadequacies discovered by the previous implementation.

Firstly, the pfsync packet parsing code in the kernel has been split up to avoid the use of switch statements to pick which code is used to parse which message types. Switch statements in C have an O(n) cost where n is the number of options you're switching between. Instead, the parser is picked by using the pfsync action in the sub-header id as an index into a table of function pointers. This moves the cost of picking a parser for a message to O(1) complexity.

Next, pfsync packet generation was moved from being done "on the fly" when pf notified us of updates, to being done only when a packet was ready to be transmitted. This in turn required that the state of a pf state with respect to pfsync be stored somewhere other than the packet currently being generated.

Previously the code determined if a state was already scheduled to be sent out on the wire by iterating over the packet in memory. This is another O(n) cost where n is the number of states scheduled in the current packet.

Because we also wanted to be able to send multiple types of messages in the same packet, it is now also necessary to mark what type of message the state had associated with it. Several queues for holding the states were created inside pfsync, one for each type of message that the state could appear in on the wire. The pf state structure was extended to record which of these queues it was currently held on. Now it is an O(1) cost to determine where a state is with respect to pfsync.

When the packet is eventually scheduled for transmission, the pfsync code walks all these queues and serialised the states into messages within the packet. As it does this iteration it simply marks the pf states as not being queued within pfsync anymore.

Another side effect of moving to queues of states was that it is now easy to move states between queues in response to pf notifications or requests from other peers. For example, pf itself could schedule a compressed update for a state which would leave it on the compressed update queue and marked as such. A peer can then request an uncompressed update for it. Where the previous implementation would have sent the previous message out immediately so it could begin a new packet with the uncompressed message type, the new code now can trivially figure out that it should simply remove the state from the compressed update queue and place it on the uncompressed update queue and mark it as such.

Next, a mechanism to mitigate against having to send "immediate" packets out immediately was developed. Since there are no general high resolution timers available in the OpenBSD kernel, it was decided that a new softnet interrupt handler be created specifically to flush the pfsync message queues into a packet for transmission.

Because all the events in pfsync are generated by code that is running at softnet, ie, the pf tests for network packets on both the systems input and output queues and the processing of pfsync packets, it is possible to queue updates for all the states touched during that processing and schedule a softnet interrupt for the pfsync packet generator. Because that code is running at softnet it masks the pfsync packet generator scheduled at softnet and prevents from running until after all the current packet processing is finished.

Additionally, on systems with busy network interfaces it is typical that you process several dozen packets per call to the softnet interrupt handlers. Any updates requiring immediate transmission of a pfsync packet can bundle all those updates into a single update before the packet transmission code is run.

Finally, the features from the pfsync v4 reworking were brought over to the new pfsync v5 code. The time at which a pf state was last updated by a pfsync packet is stored in the pf state. If an update from pf for a state occurred within a second of the update from the pfsync system, it is determined that the traffic for that state is now split over two peers in the cluster and it is marked for "immediate" transmission by the softnet handler.

The IACK handling made against the pfsync v4 implementation was also brought over, however, instead of the pointer to the packet and its timeout being stored in the state, a separate queue for deferred packets was added to the pfsync code. This was done because the space required for the packet pointer and the timeout was considered excessively wasteful for every state to store when it was only to be used for an extremely short period of time. This was weighed against the extra cost in terms of CPU time of handling that separate queue, which was considered worth it for the memory savings.

Because of the changes to the wire protocol, tools outside the kernel that parsed pfsync packets must be updated to understand the new packet format. The only tool in the OpenBSD source tree that parses those packets is tcpdump. This program was updated to handle the new packet format as part of this work.

Overall the changes to the OpenBSD system resulted in a unified diff to the source tree touching 10 separate files and totaling over 4000 lines of changes.

[ To be continued... ]

(Comments are closed)

Comments

By Anonymous Coward (76.103.56.154) on 2009-02-26 07:16

This staggered approach is totally killing the buzz...
By Anonymous Coward (87.144.93.92) on 2009-02-26 13:14

How many parts will there be?
Comments
1. By jason (jason) on 2009-02-26 16:45 http://www.dixongroup.net/
  
  One more.
By Anonymous Coward (81.227.50.210) on 2009-02-26 13:29

Great work! Excellent articles!
Keep more or this kind coming!!
By Timo Schoeler (2001:1560:2:0:208:2ff:fe8e:361c) timo@riscworks.net on 2009-02-26 14:53 http://riscworks.net

Hi,

as a really big fan of OpenBSD and the related projects BUT being confronted with GNU/Linux setups from time to time this is a question that raises up for me.

AFAICS there's no possibility (at least for UDP traffic like DNS) using pf/pfsync/relayd/carp to create similar setups as with LVS et al. I'm thinking of redundancy as well as load balancing.

Please correct me, if I'm wrong (I hope to be! :)...
Comments
1. By Lennie (194.213.15.37) on 2009-02-26 16:31
  
  > Hi,
  >
  > as a really big fan of OpenBSD and the related projects BUT being confronted with GNU/Linux setups from time to time this is a question that raises up for me.
  >
  > AFAICS there's no possibility (at least for UDP traffic like DNS) using pf/pfsync/relayd/carp to create similar setups as with LVS et al. I'm thinking of redundancy as well as load balancing.
  >
  > Please correct me, if I'm wrong (I hope to be! :)...
  
  Have you looked at relayd ? Look in the undeadly.org archives for Direct Server Return.
  
  Maybe that's what you are looking for ?
2. By Xipher (134.161.192.147) on 2009-02-26 17:46
  
  > Hi,
  >
  > as a really big fan of OpenBSD and the related projects BUT being confronted with GNU/Linux setups from time to time this is a question that raises up for me.
  >
  > AFAICS there's no possibility (at least for UDP traffic like DNS) using pf/pfsync/relayd/carp to create similar setups as with LVS et al. I'm thinking of redundancy as well as load balancing.
  >
  > Please correct me, if I'm wrong (I hope to be! :)...
  
  relayd has code specifically for DNS. How ever that is the only UDP protocol it supports currently. relayd relies on stateful tracking though, and uses the request IDs in DNS queries to handle state in that case.
  Comments
  1. By Timo Schoeler (2001:1560:2:0:208:2ff:fe8e:361c) timo@riscworks.net on 2009-02-27 10:19 http://riscworks.net
    
    > > Hi,
    > >
    > > as a really big fan of OpenBSD and the related projects BUT being confronted with GNU/Linux setups from time to time this is a question that raises up for me.
    > >
    > > AFAICS there's no possibility (at least for UDP traffic like DNS) using pf/pfsync/relayd/carp to create similar setups as with LVS et al. I'm thinking of redundancy as well as load balancing.
    > >
    > > Please correct me, if I'm wrong (I hope to be! :)...
    >
    > relayd has code specifically for DNS. How ever that is the only UDP protocol it supports currently. relayd relies on stateful tracking though, and uses the request IDs in DNS queries to handle state in that case.
    
    Yeah, that's the problem I have: I'd like to provide hot failover for custom UDP services (like weird stuff connecting to Port 7001 behind the load balancer, i.e. NATed)...
    
    Comments
    
    By Anonymous Coward (98.127.110.254) on 2009-03-01 13:07
    
    > Yeah, that's the problem I have: I'd like to provide hot failover for custom UDP services (like weird stuff connecting to Port 7001 behind the load balancer, i.e. NATed)...
    
    You might look at the source-hash option for rdr in pf.conf(5).

Latest Articles

Sat, Jul 05
- 08:17 KDE Plasma 6.4 has landed in OpenBSD (0)
- 08:13 Blink and you'll miss it! 4096 colours and flashing text on the console! (1)
- 08:08 Game of Trees Hub now taking signups for repository hosting (0)
Sat, Jun 28
- 05:57 Game of Trees 0.115 released (0)
Tue, Jun 24
- 07:48 Game of Trees 0.114 released (0)
- 07:23 Call for testing: bge/bnx/iavf/igc/ix/ixl/ngbe/pcn: ifq_restart() fix (0)
Mon, Jun 16
- 08:22 j2k25 hackathon report from kn@: installer, low battery, and more (0)
Fri, Jun 13
- 11:18 dhcpd(8): use UDP sockets instead of BPF (1)
Thu, Jun 12
- 12:32 clang(1)/llvm/lld(1) updated to version 19 (0)

Credits

Copyright © 2004-2008 Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to April 2nd 2004 as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]