Redesign of the pfsync Protocol, Part 4

Contributed by jason on 2009-03-01 from the wrapping-it-up-with-a-big-bow dept.

In the final installment of this series (see also Part 1, Part 2 and Part 3) David Gwynne (dlg@) discloses the performance impact of his pfsync changes.

The new code results in less pfsync messages being exchanged between pf firewalls when compared to the old code in the exact same setups. Users with active-passive setups will gain a benefit from the new code because of this, and may also gain a small amount of CPU time back due to the relative efficiency of the new implementation.

Active-Active Firewall Cluster Support in OpenBSD (continued)

Results

During the implementation of the new version of the pfsync protocol, several problems in the OpenBSD kernel were uncovered.
Insufficient splsoftnet() Protection In The Network Stack

The pfsync code assumed that all paths into it from pf would hold splsoftnet, which was an assumption that was necessary to guarantee that the pfsync data structures were sufficiently locked.

Testing of the code during development kept showing corrupt data structures in pfsync and in the mbuf (network packet handling structures in the OpenBSD kernel) pools. These corruptions inevitably led to panics in the OpenBSD kernel. Because of this corruption, calls to splassert (a function that checks if the current CPU interrupt mask is at least as high as the level required) were added to the entry points into the pfsync driver to check if softnet was actually held.

It was discovered that there were cases when pfsync was being called without softnet being held.

When a normal network interface is brought up, i.e., configured to send and receive packets, the IPv6 protocol is also enabled on that interface and immediately generates IPv6 duplicate address detection packets. These packets are built and sent up the network stack without the spl being raised.

This meant that large portions of the network stack, including pf and pfsync, were being used without appropriate protection. If the system received a packet during this time it was likely that the state in these systems would become corrupted. This was indeed the case we discovered with pfsync.

As a result of this discovery several code paths from the IPv6 stack had the necessary calls to splsoftnet() added to provide the appropriate protection required by the network stack.

This fix was developed by Ryan McBride and committed to the OpenBSD source tree. Further splasserts have also been added to other parts of the network stack to try and catch any further problems.

ip_output() Panic

After pfsync generates a packet to be transmitted, it hands it to the ip_output function for it to be sent up the stack and on to the wire. A combination of three factors caused some of these packets to generate a panic.

Firstly, pfsync generates packets with the DF (don't fragment) flags set in the IP header. This means that the network stack should not break the packet up into multiple IP frames if it is too large to transmit.

Secondly, pfsync sends to a multicast address by default. This changes how ip_output behaves internally in several ways, but most relevant here is that packets sent to multicast addresses don't necessarily result in a route lookup.

Lastly, due to an accounting error the pfsync code would generate network packets that were too large to be transmitted. When ip_output is asked to deal with a packet larger than the interfaces MTU that has the DF flag set, it takes that opportunity to check if the route to the destination address needs to be updated with a smaller MTU.

Because the pfsync packet was being sent to a multicast address the local variable inside ip_output holding the destinations route was not set. When ip_output tried to use that variable to update the route's MTU it generated a system trap caused by an access to an invalid memory address.

The fix to this problem was developed by Claudio Jeker. A simple check to see if the route variable was not NULL was added before any attempt to use or modify the route was done. This fix was also committed to the OpenBSD source tree.

Functional Results

Unfortunately the time taken to implement the new protocol and its handling in the OpenBSD kernel and the debug it left little time for testing and evaluation of performance. However, despite this the initial results are much better than expected.

The new code results in less pfsync messages being exchanged between pf firewalls when compared to the old code in the exact same setups. Users with active-passive setups will gain a benefit from the new code because of this, and may also gain a small amount of CPU time back due to the relative efficiency of the new implementation.

Even better, the new code makes active-active firewalls actually work.

This code was developed with the idea that async paths through pf firewalls was a worst case situation for a cluster of firewalls, and that the code modifications to support it were simply to make that case tolerable rather than unusable like the current code. Because of this it was predicted that traffic forwarded over async paths would be at a rate noticeably slower than the same traffic going over a single firewall.

This was indeed the case with the simple modifications to the pfsync v4 code base. There was a always a significant slow down with any traffic over async paths compared to the traffic sent over a single firewall.

However, it appears that the new pfsync protocol and implementation scales a lot better. Relatively slow TCP connections, (less than 10 thousand packets per second) do not experience any slow when split across async paths. At this rate it is trivial for the pfsync traffic to keep up with the rate at which the TCP session window moves forward. As the TCP packets per second increases above that threshold, the pfsync updates begin to struggle with keeping each firewalls view of the sequence numbers in sync. As a result the TCP state matching code in pf begins to drop packets that have moved beyond what it thinks the windows should be.

This result is in line with the expectation stated above. However, compared to the behaviour of the old implementation where async traffic simply stalls, this is a massive improvement.

Experience has shown that the majority of connections through a firewall tends toward lots of relatively slow streams, rather than one massively fast stream. From that point of view it is possible that the worst case behaviour with the new code will not be noticeable in practice.

The characteristics of the new protocol are also heavily dependent on the behaviour of the hardware that it is running on. The quality of the interrupt mitigation and the choice on network cards has a heavy influence on how many packets are processed at softnet. It is unfortunate that more time was not available to gain some understanding of these interactions.

Conclusion

Summary and Conclusion

The new protocol and the rewrite of the pfsync kernel code is a success. Not only does it allow active-active firewall clusters to be built, but it also improves the performance for currently supported active-passive configurations by reducing the network load of the associated pfsync traffic.

With this code it is now possible to increase the throughput between two networks by adding firewalls, rather than having to scale the performance of a single active firewall which takes all the load. Each peer in such an active-active cluster will be able to act as an independent gateway from the point of view of the client systems, but the network administrator will still have the ability to apply policy with the pf firewall and not have to give up security for the performance gained by running multiple gateways. Effort should be spent attempting to engineer the network so both the send and receive path will travel over the one firewall, but if that engineering fails it is possible that the service will degrade rather fail.

Future Work

Despite the improvement the code is still relatively immature and is not currently considered a complete replacement for the previous pfsync implementation. Testing between firewalls based on different CPU architectures is required to ensure no endian or alignment issues exist in the code base. Previously supported features such as the IPsec security association syncing also need to be tested to ensure the new version of the protocol and implementation support those features.

There are also some hard to fix issues with the new implementation when you move from sync traffic paths to async paths. The TCP state merge code in the state update input path seems to reject valid information which can leave a peer without the most recent information required to forward already existing connections.

Working through those issues should be a relatively trivial task given the right hardware, but was impossible to do in the time available.

It is also unlikely that this code will make it to the OpenBSD 4.5 release for these same reasons, but it is almost a certainty that it will be integrated into the 4.6 release. Work with other OpenBSD developers is continuing to ensure the code is reliable enough for inclusion in the source tree.

David adds the following addendum regarding changes since his paper was released:

Two features relating to async paths need to be fixed before it can be usable in active-active firewall setups. Firstly, the merging pfsync state updates with the local pf state needs to be rewritten. This is confused by the synproxy functionality in pf which abuses the TCP state fields. That case has to be handled correctly before the code can be integrated. Secondly, the currently implementation does not handle deferrals of initial packets correctly. Deferred packets appear to be only sent via the timeout on those packets, the peers in a cluster do not generate the IACK messages necessary to have that packet sent sooner. This leads to a noticeable delay for new sessions created through a firewall which is unacceptable in practice. Working through those issues should be a relatively trivial task given more time and a better test environment.

We'd like to thank David Gwynne (dlg@) for his fine work on pfsync and for allowing us to reprint his paper for our readers. Please show your appreciation by donating to the OpenBSD project so that we can continue to enjoy the rewards of projects like this.

(Comments are closed)

Comments

By Anonymous Coward (84.245.24.117) on 2009-03-01 22:50

It's good stuff to read and i could follow the ideas of design.
at part 3 i think, you give us a look in the openbsd kernel, with the lock and ipl's is there a general guide how the openbsd kernel internally operates??
Comments
1. By Anonymous Coward (98.127.110.254) on 2009-03-02 00:04
  
  > It's good stuff to read and i could follow the ideas of design.
  > at part 3 i think, you give us a look in the openbsd kernel, with the lock and ipl's is there a general guide how the openbsd kernel internally operates??
  
  Design and Implementation of the 4.4 BSD Operating System
  Comments
  1. By Anonymous Coward (2a01:348:108:100:230:18ff:fea0:6af6) on 2009-06-13 21:45
    
    > > It's good stuff to read and i could follow the ideas of design.
    > > at part 3 i think, you give us a look in the openbsd kernel, with the lock and ipl's is there a general guide how the openbsd kernel internally operates??
    >
    > Design and Implementation of the 4.4 BSD Operating System
    
    Also (from the networking side) TCP/IP Illustrated Vol.2 can be useful. Yes, they're old and things have changed, so you need to make good use of cvs logs, diffs, reading the code and most importantly using your brain, but these books give good commentary.
By Anonymous Coward (70.81.15.127) on 2009-03-02 01:46

I didn't get a chance to read the article yet, but was wondering, how will this work with DHCP'd environments? Will we still be able to use the old method if needed?

As it stands now, I get my IP's from DHCP (my MAC address is also registered). I use CARP on my NAT Routers/GWs with some simple tricks - not sure if an article is worth while here?

It works great since the inception of CARP in OpenBSD, so hopefully this won't be gone if static IPs will be a strict requirement on the external side? Unless of course we could do similar tricks with CARP in the code itself for the possibility of native DHCP support in CARP...?

Curious to hear of future goals as well... =)

Regards,
By Anonymous Coward (194.78.205.247) on 2009-03-02 06:51

Thanks a bunch for these!

Quick question:
>Each peer in such an active-active cluster will be able to act as an independent gateway from the point of view of the client systems

The word "independent" confuses me here. Does this mean I have to configure my hosts to use Peer A or Peer B? In which case won't the hosts using Peer A be screwed if it goes down? Or am I misinterpreting?
Comments
1. By Anonymous Coward (2a01:348:108:155:216:41ff:fe53:6a45) on 2009-03-02 10:44
  
  > Thanks a bunch for these!
  >
  > Quick question:
  > >Each peer in such an active-active cluster will be able to act as an independent gateway from the point of view of the client systems
  >
  > The word "independent" confuses me here. Does this mean I have to configure my hosts to use Peer A or Peer B? In which case won't the hosts using Peer A be screwed if it goes down? Or am I misinterpreting?
  
  You could use two independent carp addresses (A being primary and B backup for one, A backup and B primary for the other).
  
  Alternatives (which would result in more connections being split between routers) would be to use carp loadbalancing, or run OSPF on hosts and announce default from both routers into that (with hosts configured to use ECMP).
By Alexander Nasonov (204.4.130.140) alnsn@yandex.ru on 2009-03-02 13:54

First of all, thanks for the great work!

Description of active-active configuration implies a pair of firewalls. Is it possible to configure more than two active firewalls, e.g. active-active-active? If the answer is yes, what would happen when one active node becomes unavailable?

Thanks,
Alex
Comments
1. By Anonymous Coward (81.165.178.114) on 2009-03-02 14:02
  
  > Description of active-active configuration implies a pair of firewalls. Is it possible to configure more than two active firewalls, e.g. active-active-active? If the answer is yes, what would happen when one active node becomes unavailable?
  
  I think active-active with 2 firewalls is a bad idea because then you should only use 50% of each firewall anyway. So when one firewall fails the other can take over the other 50%. If you use more that 50% on each in a 2 node active-active setup, you actually create 2 points of failure. No failover for you then. Correct?
  Comments
  1. By Anonymous Coward (194.78.205.247) on 2009-03-02 14:14
    
    > I think active-active with 2 firewalls is a bad idea because then you should only use 50% of each firewall anyway. So when one firewall fails the other can take over the other 50%. If you use more that 50% on each in a 2 node active-active setup, you actually create 2 points of failure. No failover for you then. Correct?
    >
    
    That depends on your situation, surely? Degraded performance might be acceptable in the event of one machine blowing up.
    
    Comments
    
    By jason (jason) on 2009-03-02 14:30 http://www.dixongroup.net/
    
    Active-active can also be useful in situations where you want a subset of your traffic to only go across one of the firewalls (or a subset of the firewall cluster) but want the states synchronized so that this can be reverted later on.
    
    The example I like to use is something we'll be putting into production soon (expect a write-up) where high traffic events are routed across an alternate set of firewalls to alleviate pressure on the primary firewall(s). This can be accomplished with BGP announcements on the front-end and MAC-based return-path routing by the destination (in this case, a load-balancer).
    
    Comments
    
    By swank (216.18.67.164) on 2009-03-02 19:21
    
    > (expect a write-up)
    
    please?
  2. By Anonymous Coward (83.227.8.240) on 2009-03-09 06:27
    
    > > Description of active-active configuration implies a pair of firewalls. Is it possible to configure more than two active firewalls, e.g. active-active-active? If the answer is yes, what would happen when one active node becomes unavailable?
    >
    >
    > I think active-active with 2 firewalls is a bad idea because then you should only use 50% of each firewall anyway. So when one firewall fails the other can take over the other 50%. If you use more that 50% on each in a 2 node active-active setup, you actually create 2 points of failure. No failover for you then. Correct?
    >
    
    Monitor load trends carefully and add a third firewall when the load goes over your comfort level.
By Erik Carlseen (68.107.78.192) on 2009-03-02 16:32

Very interesting work! I'm curious as to what sort of hardware the testing was done on, and if you have any idea whether the limitations you were hitting at 10K packets/second were due to CPU overhead, network bandwidth on the links used to synchronize the states, or other issues (bus saturation, etc).
Comments
1. By Anonymous Coward (66.39.160.90) on 2009-03-03 21:37
  
  > Very interesting work! I'm curious as to what sort of hardware the testing was done on, and if you have any idea whether the limitations you were hitting at 10K packets/second were due to CPU overhead, network bandwidth on the links used to synchronize the states, or other issues (bus saturation, etc).
  >
  
  Sounds like the pfsync protocol itself isn't efficient enough to pass enough pfsync window update messages in his test setup when the TCP flow went beyond 10000pps. Holy wah eh?
  Comments
  1. By Erik Carlseen (68.107.78.192) on 2009-03-04 04:50
    
    > > Very interesting work! I'm curious as to what sort of hardware the testing was done on, and if you have any idea whether the limitations you were hitting at 10K packets/second were due to CPU overhead, network bandwidth on the links used to synchronize the states, or other issues (bus saturation, etc).
    > >
    >
    > Sounds like the pfsync protocol itself isn't efficient enough to pass enough pfsync window update messages in his test setup when the TCP flow went beyond 10000pps. Holy wah eh?
    
    That's a theoretical peak of around 120Mbps, assuming a 1500-byte MTU and all of your packets being full. There are situations where that would be considered a major bottleneck. I've been looking for ways to insert OpenBSD into some of these projects, and high-speed active-active support might kick some customers over the edge... They have a difficult time wrapping their heads around FOSS.

Latest Articles

Sat, Jul 05
- 08:17 KDE Plasma 6.4 has landed in OpenBSD (0)
- 08:13 Blink and you'll miss it! 4096 colours and flashing text on the console! (2)
- 08:08 Game of Trees Hub now taking signups for repository hosting (0)
Sat, Jun 28
- 05:57 Game of Trees 0.115 released (0)
Tue, Jun 24
- 07:48 Game of Trees 0.114 released (0)
- 07:23 Call for testing: bge/bnx/iavf/igc/ix/ixl/ngbe/pcn: ifq_restart() fix (0)
Mon, Jun 16
- 08:22 j2k25 hackathon report from kn@: installer, low battery, and more (0)
Fri, Jun 13
- 11:18 dhcpd(8): use UDP sockets instead of BPF (1)
Thu, Jun 12
- 12:32 clang(1)/llvm/lld(1) updated to version 19 (0)

Credits

Copyright © 2004-2008 Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to April 2nd 2004 as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]