OpenBSD Journal

PF: Firewall Ruleset Optimization

Contributed by dhartmei on from the eleven-releases-with-pf dept.

This is the first installment in a series of three articles about PF. I originally wrote them as chapters for a book, but then publication was cancelled. Luckily, the rights could be salvaged, and now you get to enjoy them as undeadly.org exclusives. In celebration of the upcoming OpenBSD 4.0 release. ;)
  • Firewall Ruleset Optimization
    • Goals
    • The significance of packet rate
    • When pf is the bottleneck
    • Filter statefully
    • The downside of stateful filtering
    • Ruleset evaluation
    • Ordering rulesets to maximize skip steps
    • Use tables for address lists
    • Use quick to abort ruleset evaluation when rules match
    • Anchors with conditional evaluation
    • Let pfctl do the work for you
  • Testing Your Firewall (read)
  • Firewall Management (read)

Firewall Ruleset Optimization

Goals

Ideally, the operation of a packet filter should not affect legitimate network traffic. Packets violating the filtering policy should be blocked, and compliant packets should pass the device as if the device wasn't there at all.

In reality, several factors limit how well a packet filter can achieve that goal. Packets have to pass through the device, adding some amount of latency between the time a packet is received and the time it is forwarded. Any device can only process a finite amount of packets per second. When packets arrive at a higher rate than the device can forward them, packets are lost.

Most protocols, like TCP, deal well with added latency. You can achieve high TCP transfer rates even over links that have several hundred milliseconds of latency. On the other hand, in interactive network gaming even a few tens of milliseconds are usually perceived as too much. Packet loss is generally a worse problem, TCP performance will seriously deteriorate when a significant number of packets are lost.

This article explains how to identify when pf is becoming the limiting factor in network throughput and what can be done to improve performance in this case.

The significance of packet rate

One commonly used unit to compare network performance is throughput in bytes per second. But this unit is completely inadequate to measure pf performance. The real limiting factor isn't throughput but packet rate, that is the number of packets per second the host can process. The same host that handles 100Mbps of 1500 byte packets without breaking a sweat can be brought to its knees by a mere 10Mbps of 40 byte packets. The former amounts to only 8,000 packets/second, but the latter traffic stream amounts to 32,000 packets/second, which causes roughly four times the amount of work for the host.

To understand this, let's look at how packets actually pass through the host. Packets are received from the wire by the network interface card (NIC) and read into a small memory buffer on the NIC. When that buffer is full, the NIC triggers a hardware interrupt, causing the NIC driver to copy the packets into network memory buffers (mbufs) in kernel memory. The packets are then passed through the TCP/IP stack in form of these mbufs. Once a packet is transferred into an mbuf, most operations the TCP/IP stack performs on the packet are not dependant on the packet size, as these operations only inspect the packet headers and not the payload. This is also true for pf, which gets passed one packet at a time and makes the decision of whether to block it or pass it on. If the packet needs forwarding, the TCP/IP stack will pass it to a NIC driver, which will extract the packet from the mbuf and put it back onto the wire.

Most of these operations have a comparatively high cost per packet, but a very low cost per size of the packet. Hence, processing a large packet is only slightly more expensive than processing a small packet.

Some limits are based on hardware and software outside of pf. For instance, i386-class machines are not able to handle much more than 10,000 interrupts per second, no matter how fast the CPU is, due to architectural constraints. Some network interface cards will generate one interrupt for each packet received. Hence, the host will start to lose packets when the packet rate exceeds around 10,000 packets per second. Other NICs, like more expensive gigabit cards, have larger built-in memory buffers that allow them to bundle several packets into one interrupt. Hence, the choice of hardware can impose some limits that no optimization of pf can surpass.

When pf is the bottleneck

The kernel passes packets to pf sequentially, one after the other. While pf is being called to decide the fate of one packet, the flow of packets through the kernel is briefly suspended. During that short period of time, further packets read off the wire by NICs have to fit into memory buffers. If pf evaluations take too long, packets will quickly fill up the buffers, and further packets will be lost. The goal of optimizing pf rulesets is to reduce the amount of time pf spends for each packet.

An interesting exercise is to intentionally push the host into this overloaded state by loading a very large ruleset like this:

  $ i=0; while [ $i -lt 100 ]; do \
      printf "block from any to %d.%d.%d.%d\n" \
        `jot -r -s " " 4 1 255`; \
      let i=i+1; \
    done | pfctl -vf -

  block drop inet from any to 151.153.227.25
  block drop inet from any to 54.186.19.95
  block drop inet from any to 165.143.57.178
  ...
This represents a worst-case ruleset that defies all automatic optimizations. Because each rule contains a different random non-matching address, pf is forced to traverse the entire ruleset and evaluate each rule for every packet. Loading a ruleset that solely consists of thousands of such rules, and then generating a steady flow of packets that must be filtered, inflicts noticeable load on even the fastest machine. While the host is under load, check the interrupt rate with:

  $ vmstat -i
And watch CPU states with:

  $ top
This will give you an idea of how the host reacts to overloading, and will help you spot similar symptoms when using your own ruleset. You can use the same tools to verify effects of optimizations later on.

Then try the other extreme. Completely disable pf like:

  $ pfctl -d
Then compare the vmstat and top values.

This is a simple way to get a rough estimate and upper limit on what to realistically expect from optimization. If your host handles your traffic with pf disabled, you can aim to achieve similar performance with pf enabled. However, if the host already shows problems handling the traffic with pf disabled, optimizing pf rulesets is probably pointless, and other components should be changed first.

If you already have a working ruleset and are wondering whether you should spend time on optimizing it for speed, repeat this test with your ruleset and compare the results with both extreme cases. If running your ruleset shows effects of overloading, you can use the guidelines below to reduce those effects.

In some cases, the ruleset shows no significant amount of load on the host, yet connections through the host show unexpected problems, like delays during connection establishment, stalling connections or disappointingly low throughput. In most of these cases, the problem is not filtering performance at all, but a misconfiguration of the ruleset which causes packets to get dropped. See Testing Your Firewall about how to identify and deal with such problems.

And finally, if your ruleset is evaluated without causing significant load and everything works as expected, the most reasonable conclusion is to leave the ruleset as is is. Often, rulesets written in a straight-forward approach without respect for performance are evaluated efficiently enough to cause no packet loss. Manual optimizations will only make the ruleset harder to read for the human maintainer, while having only insignificant effect on performance.

Filter statefully

The amount of work done by pf mainly consists of two kinds of operations: ruleset evaluations and state table lookups.

For every packet, pf first does a state table lookup. If a matching state entry is found in the state table, the packet is immediately passed. Otherwise pf evaluates the filter ruleset to find the last matching rule for the packet which decides whether to block or pass it. If the rule passes the packet, it can optionally create a state entry using the 'keep state' option.

When filtering statelessly, without using 'keep state' to create state entries for connections, every packet causes an evaluation of the ruleset, and ruleset evaluation is the single most costly operation pf performs in this scenario. Each packet still causes a state table lookup, but since the table is empty, the cost of the lookup is basically zero.

Filtering statefully means using 'keep state' in filter rules, so packets matching those rules will create a state table entry. Further packets related to the same connections will match the state table entries and get passed automatically, without evaluations of the ruleset. In this scenario, only the first packet of each connection causes a ruleset evaluation, and subsequent packets only cause a state lookup.

Now, a state lookup is much cheaper than a ruleset evaluation. A ruleset is basically a list of rules which must be evaluated from top to bottom. The cost increases with every rule in the list, twice as many rules mean twice the amount of work. And evaluating a single rule can cause comparison of numerous values in the packet. The state table, on the other hand, is a tree. The cost of lookup increases only logarithmically with the number of entries, twice as many states mean only one additional comparison, a fraction of additional work. And comparison is needed only for a limited number of values in the packet.

There is some cost to creating and removing state entries. But assuming the state will match several subsequent packets and saves ruleset evaluation for them, the sum is much cheaper. For specific connections like DNS lookups, where each connection only consists of two packets (one request and one reply), the overhead of state creation might be worse than two ruleset evaluations. Connections that consist of more than a handful of packets, like most TCP connections, will benefit from the created state entry.

In short, you can make ruleset evaluation a per-connection cost instead of a per-packet cost. This can easily make a factor of 100 or more. For example, I see the following counters when I run:

  $ pfctl -si

  State Table                          Total             Rate
    searches                       172507978          887.4/s
    inserts                          1099936            5.7/s
    removals                         1099897            5.7/s
  Counters
    match                            6786911           34.9/s
This means pf gets called about 900 times per second. I'm filtering on multiple interfaces, so that would mean I'm forwarding about 450 packets per second, each of which gets filtered twice, once on each interface it passes through. But ruleset evaluation occurs only about 35 times per second, and state insertions and deletions only 6 times per second. With anything but a tiny ruleset, this is very well worth it.

To make sure that you're really creating state for each connection, search for 'pass' rules which don't use 'keep state', like in:

  $ pfctl -sr | grep pass | grep -v 'keep state'
Make sure you have a tight 'block by default' policy, as otherwise packets might pass not only due to explicit 'pass' rules, but mismatch all rules and pass by default.

The downside of stateful filtering

The only downside to stateful filtering is that state table entries need memory, around 256 bytes for each entry. When pf fails to allocate memory for a new state entry, it blocks the packet that should have created the state entry instead, and increases an out-of-memory counter shown by:

  $ pfctl -si
  Counters
    memory                                 0            0.0/s
Memory for state entries is allocated from the kernel memory pool called 'pfstatepl'. You can use vmstat(8) to show various aspects of pool memory usage:

  $ vmstat -m
  Memory resource pool statistics
  Name        Size Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg Maxpg Idle
  pfstatepl    256  1105099    0  1105062   183   114    69   127     0 625   62
The difference between 'Requests' and 'Releases' equals the number of currently allocated state table entries, which should match the counter shown by:

  $ pfctl -si
  State Table                          Total             Rate
    current entries                       36
Other counters shown by pfctl can get reset by pfctl -Fi.

Not all memory of the host is available to the kernel, and the way the amount of physical RAM affects the amount available to the kernel depends on architecture and kernel options and version. As of OpenBSD 3.6, an i386 kernel can use up to 256MB of memory. Prior to 3.6, that limit was much lower for i386. You could have 8GB of RAM in your host, and still pf would fail to allocate memory beyond a small fraction of that amount.

To make matters worse, when pf really hits the limit where pool_get(9) fails, the failure is not as graceful as one might wish. Instead, the entire system becomes unstable after that point, and eventually crashes. This really isn't pf's fault, but a general problem with kernel pool memory management.

To address this, pf itself limits the number of state entries it will allocate at the same time, using pool_sethardlimit(9), also shown by vmstat -m output. The default for this limit is 10,000 entries, which is safe for any host. The limit can be printed with:

  $ pfctl -sm
  states     hard limit  10000
  src-nodes  hard limit  10000
  frags      hard limit    500
If you need more concurrent state entries, you can increase the limit in pf.conf with:

  set limit states 10000
The problem is determining a large value that is still safe enough not to trigger a pool allocation failure. This is still a sore topic, as there is no simple formula to calculate the value. Basically, you have to increase the limit and verify the host remains stable after reaching that limit, by artificially creating many entries.

On the bright side, if you have 512MB or more of RAM, you can now use 256MB for the kernel, which should be safe for at least 500,000 state entries. And most people consider that a lot of concurrent connections. Just imagine each of those connections generating just one packet every ten seconds, and you end up with a packet rate of 50,000 packets/s.

More likely, you don't expect that many states at all. But whatever your state limit is, there are cases where it will be reached, like during a denial-of-service attack. Remember, pf will fail closed not open when state creation fails. An attacker could create state entries until the limit is reached, just for the purpose of denying service to legitimate users.

There are several ways to deal with this problem.

You can limit the number of states created from specific rules, for instance like:

  pass in from any to $ext_if port www keep state (max 256)
This would limit the number of concurrent connections to the web server to 256, while other rules could still create state entries. Similarly, the maximum number of connections per source address can be restricted with:

  pass keep state (source-track rule, max-src-states 16)
Once a state entry is created, various timeouts define when it is removed. For instance:

  $ pfctl -st
  tcp.opening                  30s
The timeout for TCP states that are not yet fully established is set to 30 seconds. These timeouts can be lowered to remove state entries more aggressively. Individual timeout values can be set globally in pf.conf:

 set timeout tcp.opening 20
They can also be set in individual rules, applying only to states created by these rules:

  pass keep state (tcp.opening 10)
There are several pre-defined sets of global timeouts which can be selected in pf.conf:

  set optimization aggressive
Also, there's adaptive timeouts, which means these timeouts are not constants, but variables which adjust to the number of state entries allocated. For instance:

  set timeout { adaptive.start 6000, adaptive.end 12000 }
pf will use constant timeout values as long as there are less than 6,000 state entries. When there are between 6,000 and 12,000 entries, all timeout values are linearly scaled from 100% at 6,000 to 0% at 12,000 entries, i.e. with 9,000 entries all timeout values are reduced to 50%.

In summary, you probably can specify a number of maximum states you expect to support. Set this as the limit for pf. Expect the limit to get reached during certain attacks, and define a timeout strategy for this case. In the worst case, pf will drop packets when state insertion fails, and the out-of-memory counter will increase.

Ruleset evaluation

A ruleset is a linear list of individual rules, which are evaluated from top to bottom for a given packet. Each rule either does or does not match the packet, depending on the criteria in the rule and the corresponding values in the packet.

Therefore, to a first approximation, the cost of ruleset evaluation grows with the number of rules in the ruleset. This is not precisely true for reasons we'll get into soon, but the general concept is correct. A ruleset with 10,000 rules will almost certainly cause much more load on your host than one with just 100 rules. The most obvious optimization is to reduce the number of rules.

Ordering rulesets to maximize skip steps

The first reason why ruleset evaluation can be cheaper than evaluating each individual rule in the ruleset is called skip steps. This is a transparent and automatic optimization done by pf when the ruleset is loaded. It's best explained with an example. Imagine you have the following simple ruleset:

  1. block in all
  2. pass in on fxp0 proto tcp from any to 10.1.2.3 port 22 keep state
  3. pass in on fxp0 proto tcp from any to 10.1.2.3 port 25 keep state
  4. pass in on fxp0 proto tcp from any to 10.1.2.3 port 80 keep state
  5. pass in on fxp0 proto tcp from any to 10.2.3.4 port 80 keep state
A TCP packet arrives in on fxp0 to destination address 10.2.3.4 on some port.

pf will start the ruleset evaluation for this packet with the first rule, which fully matches. Evaluation continues with the second rule, which matches the criteria 'in', 'on fxp0', 'proto tcp', 'from any' but doesn't match 'to 10.1.2.3'. So the rule does not match, and evaluation should continue with the third rule.

But pf is aware that the third and fourth rule also specify the same criterion 'to 10.1.2.3' which caused the second rule to mismatch. Hence, it is absolutely certain that the third and fourth rule cannot possibly match this packet, either, and immediately jumps to the fifth rule, saving several comparisons.

Imagine the packet under inspection was UDP instead of TCP. The first rule would have matched, evaluation would have continued with the second rule. There, the criterion 'proto tcp' would have made the rule mismatch the packet. Since the subsequent rules also specify the same criterion 'proto tcp' which was found to mismatch the packet, all of them could be safely skipped, without affecting the outcome of the evaluation.

Here's how pf analyzes your ruleset when you load it. Each rule can contain a list of criteria like 'to 10.1.2.3', restricting the rule to match packets with that destination address. For each criteria in each rule, pf counts the number of rules immediately below that rule which specify the exact same criterion. This can be zero, when the next rule does not use the exact same criterion. Or it can be any number up to the number of remaining rules, when they all specify it. The counted numbers are stored in memory for later use. They're called skip steps because they tell pf how many subsequent steps (rules) can be skipped when any criteria in any rule is found to not match the packet being inspected.

Rule evaluation compares the criteria in the rule against the values in the packet in a fixed order:

  1. interface ('on fxp0')
  2. direction ('in', 'out')
  3. address family ('inet' or 'inet6')
  4. protocol ('proto tcp')
  5. source address ('from 10.1.2.3')
  6. source port ('from port < 1024')
  7. destination address ('to 10.2.3.4')
  8. destination port ('to port 80')
If the rule completely matches, evaluation continues on the very next rule. If the rule does not match, the first criterion from the list above which mismatches decides which skip step is used. There might be more than one criterion which mismatches, but only the first one, in the order of the list above, matters.

Obviously, the order of rules in your ruleset affects the skip step values calculated for each rule. For instance:

  1. pass on fxp0
  2. pass on fxp1
  3. pass on fxp0
  4. pass on fxp1
This ruleset will produce skip steps with value zero for the interface criterion in each rule, because no adjacent rules contain the same interface criterion.

Those rules could instead be ordered like:

  1. pass on fxp0
  2. pass on fxp0
  3. pass on fxp1
  4. pass on fxp1
The skip step value for the interface criterion would then equal one in the first and third rule.

This makes a small difference when the ruleset is evaluated for a packet on fxp2. Before the reordering, all four rules are evaluated because none of them can be skipped. After the reordering, only rules one and three need to be evaluated, and rules two and four can be skipped.

The difference may be insignificant in this little example, but imagine a ruleset containing 1,000 rules which all apply only to two different interfaces. If you order these rules so all rules applying to one interface are adjacent, followed by the rules applying to the other interface, pf can reliably skip 500 rules in each and every evaluation of the ruleset, reducing the cost of ruleset evaluation to 50%, no matter what packets your traffic consists of.

Hence, you can help pf maximize its skip steps by ordering your rules by the criteria in the order they are listed above, i.e. order your rules by interface first. Within the block of rules for the same interface, order rules by direction. Within the block for the same interface and direction, order by address family, etc.

To verify the effects, run

  $ pfctl -gsr
pfctl prints the calculated skip step values for each criterion in each rule, for instance

  @18 block return-rst in quick on kue0 proto tcp from any to any port = 1433
  [ Skip steps: i=38 d=38 f=41 p=27 sa=48 sp=end da=43 ]
In this output, 'i' stands for interface, 'd' for direction, 'f' for address family, etc. The 'i=38' part means that packets which don't match 'on kue0' will skip to rule number 38.

This also affects the number of evaluations counted for each rule, try:

  $ pfctl -vsr
pfctl counts how many times each rule has been evaluated, how many packets and bytes it matched and how many states it created. When a rule is skipped by skip steps during evaluation, its evaluation counter is not increased.

Use tables for address lists

The use of lists in curly braces allows to write very compact rules in pf.conf, like:

  pass proto tcp to { 10.1.2.3, 10.2.3.4 } port { ssh, www }
But these lists are not actually loaded into a single rule in the kernel. Instead, pfctl expands the single input rule to multiple rules for the kernel, in this case

  $ echo "pass proto tcp to { 10.1.2.3, 10.2.3.4 } port { ssh, www }" |
	pfctl -nvf -
  pass inet proto tcp from any to 10.1.2.3 port = ssh keep state
  pass inet proto tcp from any to 10.1.2.3 port = www keep state
  pass inet proto tcp from any to 10.2.3.4 port = ssh keep state
  pass inet proto tcp from any to 10.2.3.4 port = www keep state
The short syntax in pf.conf betrays the real cost of evaluating it. Your pf.conf might be only a dozen rules long, but if those expand to hundreds of rules in kernel, evaluation cost is the same as if you put those hundreds of rules in pf.conf in the first place. To see what rules are really being evaluated, check:

  $ pfctl -sr
For one specific type of list, addresses, there is a container in kernel, called 'table'. For example:

  pass in from { 10.1.2.3, 10.2.3.4, 10.3.4.5 }
The list of addresses can be expressed as a table:

  table <clients> const { 10.1.2.3, 10.2.3.4, 10.3.4.5 }
  pass in from <clients>
This construct can be loaded as a single rule (and a table) into the kernel, whereas the non-table version would expand to three rules.

During evaluation of the rule referencing the table, pf will do a lookup of the packet's source address in the table to determine whether the rule matches the packet. This lookup is very cheap, and the cost does not increase with the number of entries in the table.

If the list of addresses is large, the performance gain of one rule evaluation with one table lookup vs. one rule evaluation for each address is significant. As a rule of thumb, tables are cheaper when the list contains six or more addresses. For a list of 1,000 addresses, the difference will be factor of 1,000.

Use quick to abort ruleset evaluation when rules match

When a rule does match, pf (unlike other packet filtering products) does not by default abort ruleset evaluation, but continues until all rules have been evaluated. When the end is reached, the last rule that matched (the last-matching rule) makes the decision.

The option 'quick' can be used in rules to make them abort ruleset evaluation when they match. When 'quick' is used on every single rule, pf's behaviour effectively becomes first-matching, but that's not the default.

For instance, pf filters packets passing through any interface, including virtual interfaces such as loopback. If, like most people, you don't intend to filter loopback traffic, a rule like the following at the top can save a lot of rule evaluations:

  set skip on { lo0 }
The ruleset might contain hundreds of rules all mismatching the loopback interface, and loopback traffic might just pass by the implicit default pass. The difference is between evaluating these hundreds of rules for every loopback packet.

Usually, you'd place a rule with 'quick' at the top of the ruleset, reasoning that it has the potential of matching and saving the evaluation of the rules further down. But in those cases where the rule does not match a packet, placement of the rule at the top has caused one more evaluation. In short, the frequency with which a rule is expected to match on average is also relevant when deciding placement within the ruleset for performance reasons. And the frequency with which it does match depends on your actual traffic.

Instead of guessing how likely a rule should match on average, you can use the rule evaluation and matching counters that are printed by:

  $ pfctl -vsr
When you see a rule near the top that is evaluated a lot but rarely matches, you can move it further down in the ruleset.

Anchors with conditional evaluation

An anchor is basically a ruleset separate from the main ruleset, or a sub-ruleset. You can load entire rulesets into anchors, and cause them to get evaluated from the main ruleset.

Another way to look at them is to compare filtering rules with a programming language. Without anchors, all your code is in a single main function, the main ruleset. Anchors, then, are just subroutines, code in separate functions that you can call from the main function.

As of OpenBSD 3.6, you can also nest anchors within anchors, building a hierarchy of subroutines, and call one subroutine from another. In OpenBSD 3.5 and before, the hierarchy could only be one level deep, that is, you could have multiple subroutines, but could call subroutines only from the main ruleset.

For instance:

  pass in proto tcp from 10.1.2.3 to 10.2.3.4 port www
  pass in proto udp from 10.1.2.3 to 10.2.3.4
  pass in proto tcp from 10.1.2.4 to 10.2.3.5 port www
  pass in proto tcp from 10.1.2.4 to 10.2.3.5 port ssh
  pass in proto udp from 10.1.2.4 to 10.2.3.5
  pass in proto tcp from 10.1.2.5 to 10.2.3.6 port www
  pass in proto udp from 10.1.2.5 to 10.2.3.6
  pass in proto tcp from 10.1.2.6 to 10.2.3.7 port www
You could split the ruleset into two sub-rulesets, one for UDP called "udp-only":

  pass in proto udp from 10.1.2.3 to 10.2.3.4
  pass in proto udp from 10.1.2.4 to 10.2.3.5
  pass in proto udp from 10.1.2.5 to 10.2.3.6
And a second one for TCP called "tcp-only":

  pass in proto tcp from 10.1.2.3 to 10.2.3.4 port www
  pass in proto tcp from 10.1.2.4 to 10.2.3.5 port www
  pass in proto tcp from 10.1.2.4 to 10.2.3.5 port ssh
  pass in proto tcp from 10.1.2.5 to 10.2.3.6 port www
  pass in proto tcp from 10.1.2.6 to 10.2.3.7 port www
Both of them can be called from the main ruleset with:

  anchor udp-only
  anchor tcp-only
That would not improve performance much, though. Actually, there is some overhead involved when the kernel has to step into and out of these sub-rulesets.

But anchor calls can also contain filter criteria, much like pass/block rules:

  anchor udp-only in on fxp0 inet proto udp
  anchor tcp-only in on fxp0 inet proto tcp
The sub-ruleset is only evaluated for packets that match the criteria. In other words, the subroutine is only conditionally evaluated. When the criteria do not match, the call is skipped, and the evaluation cost is limited to the comparison of the criteria in the call.

For performance, this is mainly relevant when the sub-ruleset contains many rules, and the call criteria are not those primarly optimized by skip steps.

Let pfctl do the work for you

As of OpenBSD 3.6, several of the optimizations discussed can be automated by pfctl -o. The optimizer analyzes a ruleset and makes modifications that do not change the effect of the ruleset.

First, pfctl splits the ruleset into blocks of adjacent rules in such a way that reordering rules within one block cannot possibly affect the outcome of evaluation for any packet.

For example, the rules in the following block can be arbitrarily reordered:

  pass proto tcp to 10.1.2.3 port www keep state
  pass proto udp to 10.1.2.3 port domain keep state
  pass proto tcp to 10.1.0.0/16 keep state
But in most cases rule order is relevant. For instance:

  block log all
  block from 10.1.2.3
  pass from any to 10.2.3.4
Changing the position of either of those rules produces completely different effects. After swapping the first two rules, packets from 10.1.2.3 still get blocked, but they're now also logged. Exchange the last two rules, and packets from 10.1.2.3 to 10.2.3.4 are suddenly blocked. And switching the first and last rule blocks every packet.

In every case of possible dependancy, pfctl splits the rules into separate blocks. In the worst case, when no two adjacent rules can be freely reordered, each rule becomes a separate block containing only that rule, and pfctl can't make any modifications.

Otherwise, pfctl sorts the rules in each block so that skip step values are maximized:

  $ cat example
  pass proto tcp from 10.0.0.3 to 10.0.0.8
  pass proto udp from 10.0.0.1
  pass proto tcp from 10.0.0.2
  pass proto tcp from 10.0.0.4
  pass proto udp from 10.0.0.6
  pass proto tcp from 10.0.0.3 to 10.0.0.7

  $ pfctl -onvf example
  pass inet proto tcp from 10.0.0.3 to 10.0.0.8
  pass inet proto tcp from 10.0.0.3 to 10.0.0.7
  pass inet proto tcp from 10.0.0.2 to any
  pass inet proto tcp from 10.0.0.4 to any
  pass inet proto udp from 10.0.0.1 to any
  pass inet proto udp from 10.0.0.6 to any
When duplicate rules are found, they are removed:

  $ cat example
  pass proto tcp from 10.0.0.1
  pass proto udp from 10.0.0.2
  pass proto tcp from 10.0.0.1

  $ pfctl -onvf example
  pass inet proto tcp from 10.0.0.1 to any
  pass inet proto udp from 10.0.0.2 to any
Redundant rules are removed as well:

  $ cat example
  pass proto tcp from 10.1/16
  pass proto tcp from 10.1.2.3
  pass proto tcp from 10/8

  $ pfctl -onvf example
  pass inet proto tcp from 10.0.0.0/8 to any
Multiple rules are combined into a single rule using a table where possible and advantageous:

  $ cat example
  pass from 10.1.2.3
  pass from 10.2.3.4
  pass from 10.3.4.5
  pass from 10.4.5.6
  pass from 10.5.6.7
  pass from 10.8.9.1

  $ pfctl -onvf example
  table <__automatic_0> const { 10.1.2.3 10.2.3.4 10.3.4.5 10.4.5.6
                                10.5.6.7 10.8.9.1 }
  pass inet from <__automatic_0> to any
When called with -oo, pfctl also consults the evaluation counters shown by pfctl -vsr to reorder 'quick' rules according to matching frequency.

It's very conservative in doing any changes, only performing changes that are certain to not affect the outcome of the ruleset evaluation under any circumstances for any packet. This has the advantage that the optimizer can be used safely with any ruleset. The drawback is that pfctl might not dare change something which you could, if you thought about the effect of the change. Like the skip step optimization, the performance improvement depends on how large the blocks of reorderable rules are. By manually reordering rules first, you can potentially improve the gain the optimizer can produce.

The easiest way to see what -o or -oo does with your ruleset is to compare its suggestion with the original ruleset, like this:

  $ pfctl -nvf /etc/pf.conf >before
  $ pfctl -oonvf /etc/pf.conf >after
  $ diff -u before after
When run against a manually optimized ruleset, the differences are usually unspectacular. Significant improvements can be expected for rulesets that were automatically generated by rule editing frontends.

Copyright (c) 2004-2006 Daniel Hartmeier <daniel@benzedrine.cx>. Permission to use, copy, modify, and distribute this documentation for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

(Comments are closed)


Comments
  1. By Anonymous Coward (62.225.37.69) on

    nice doc, thnx for the work.

    Comments
    1. By tobiasu (84.57.194.176) on

      > nice doc, thnx for the work.

      second that

    2. By sand (217.220.29.251) on

      > nice doc, thnx for the work.

      Yes, great article! Keep on! :)

    3. By Anonymous Coward (69.3.44.234) on

      > nice doc, thnx for the work.

      Yes, thanks for sharing this! I'm looking forward to the next two parts.


  2. By Henrik Gustafsson (85.224.65.86) on

    > The list of addresses can be expressed as a table:
    >  table  const { 10.1.2.3, 10.2.3.4, 10.3.4.5 }
    >  pass in from 
    
    I think a few <>s got lost somewhere :) Great article though. And most useful to me!

    Comments
  3. By Anonymous Coward (193.63.217.208) on

    Trivial nits picked in an excellent article:

    A reference to "This chapter" and the sentence "See the chapter 21 about how to identify and deal with such problems." should perhaps refer to this and a future article.

  4. By Marius (82.77.23.234) on

    Thank you, your article was very helpful.

  5. By Igor Sobrado (156.35.192.2) on

    Excellent work!

    Perhaps you should write a text, or PDF, document once the three parts of the chapter has been published and provide it as a part of the official documentation on pf. The notes on how optimization works on pf will make the first part of this chapter very valuable itself, and a good companion to the pf FAQ. There is a lot of useful information on it, and it is very easy to read.

    Just a though...

    Comments
    1. By Renaud Allard (85.201.63.39) on

      > Excellent work!
      >
      > Perhaps you should write a text, or PDF, document once the three parts of the chapter has been published and provide it as a part of the official documentation on pf. The notes on how optimization works on pf will make the first part of this chapter very valuable itself, and a good companion to the pf FAQ. There is a lot of useful information on it, and it is very easy to read.
      >
      > Just a though...

      I second that. The three articles should be published on openbsd.org website.

  6. By Anonymous Coward (24.86.207.147) on

    Awesome article.

  7. By Jason (142.179.203.230) on

    This potentially would have been the book coauthored with Jacek and mcbride@?

    Comments
    1. By Daniel Hartmeier (62.65.145.30) daniel@benzedrine.cx on http://www.benzedrine.cx/dhartmei.html

      > This potentially would have been the book coauthored with Jacek and mcbride@?

      Yes. It was initially scheduled for release last year, but the editing process got stuck when Jacek stopped responding. All attempts to contact him have failed for almost an entire year now. O'Reilly has been very professional and patient about it, but finally cancelled the contract last month.

      We hope Jacek is alright and not having serious health problems (as has been the case in the past, unfortunately). If anyone knows about his whereabouts, please let me know.

      Comments
      1. By Anonymous Coward (24.34.57.27) on

        > > This potentially would have been the book coauthored with Jacek and mcbride@?
        >
        > Yes. It was initially scheduled for release last year, but the editing process got stuck when Jacek stopped responding. All attempts to contact him have failed for almost an entire year now. O'Reilly has been very professional and patient about it, but finally cancelled the contract last month.
        >
        > We hope Jacek is alright and not having serious health problems (as has been the case in the past, unfortunately). If anyone knows about his whereabouts, please let me know.
        >

        I would think O'Reilly would have contact information? I mean, going into a contract with him, and all.

      2. By John Bacall (172.129.89.186) on

        > Yes. It was initially scheduled for release last year, but the editing

        I really would love to see a pf book by you, Daniel. Thanks for the series of articles in the interim.

  8. By Pablo Méndez Hernández (213.0.43.166) pablomh@gmail.com on

    Hi Daniel:

    Based on your following two comments, I'd like to ask you which of the architectures supported by OpenBSD are better focused as high performance firewalls:

    - "i386-class machines are not able to handle much more than 10,000 interrupts per second, no matter how fast the CPU is, due to architectural constraints"

    - "...and the way the amount of physical RAM affects the amount available to the kernel depends on architecture and kernel options and version. As of OpenBSD 3.6, an i386 kernel can use up to 256MB of memory"


    Thanks in advance.

    Comments
    1. By petruha (86.57.0.25) on http://petruha.bsd.lv/

      > - "i386-class machines are not able to handle much more than 10,000 interrupts per second, no matter how fast the CPU is, due to architectural constraints"
      Speaking of which, how do other architectures compare to i386? IIRC, it was mentioned in misc@ that amd64 also is not very good for high PPS.

    2. By sthen (81.168.66.230) on

      > - "i386-class machines are not able to handle much more than 10,000 interrupts per second, no matter how fast the CPU is, due to architectural constraints"

      that's the point when you get to choose between things like these:

      - change to a nic with good interrupt mitigation, maybe sk(4)
      - start using jumbo frames if applicable
      - add more boxes in parallel (between multipath, ospf and carp load-balancing there's a lot you can do here)
      - implement (or pay someone to implement) polling and turn off the interrupts

      Comments
      1. By Pablo Méndez Hernández (213.0.43.166) pablomh@gmail.com on

        I was asking more on the line of:

        - Use sparc64/amd64/... arch, who from an architecture point of view are better dealing with interrupts.

  9. By Anonymous Coward (207.157.220.40) on

    i'm sure this it the wrong arena for feature suggestions .. but i don't spend a lot of time playing with openbsd anymore (unfortunately).

    regardless, perhaps a configuration option in 'pf.conf' is in order for the automatic optimization of the rules (i.e., the effect of 'pfctl -o' or 'pfctl -oo')? i assume that there are many who order their rules a specific way and have good reason to, but i imagine there are a lot that could also just trust pfctl to re-order them automatically. or even periodically (based on the counters).

    again, not asking for the feature for myself .. just a thought i had :) really glad to see danial still kicking ass with pf though .. cheers

  10. By Daniel Ouellet (66.63.10.94) daniel@presscom.net on

    Great paper Daniel!

    This is only equal to the same quality of PF itself!

    It's really to bad that the book got cancel as I have to say judging by this article, the book would have been a MUST HAVE!

    I love this type of details, example and explications of what's done, how and why including optimization of it, showing the results, the impact of it, how to improve them and also the pitfall not to fall into!

    Not sure why the book got cancel, but that's a sad lost for all of us and you to I am sure! No questions that it must have been a HUGE amount of work for sure!

    Hopefully it might come back to life under a different publisher, but I sure can only tell you that it is a great piece of work equal to PF itself!

    Many thanks Daniel to share your work and inside to the hart of PF like this!

    We don't see this type of quality to often sadly!

    Best regards,

    Daniel

    Comments
    1. By steve szmidt (71.98.138.249) on

      > Great paper Daniel!
      >
      > This is only equal to the same quality of PF itself!
      >
      > It's really to bad that the book got cancel as I have to say judging by this article, the book would have been a MUST HAVE!

      I'm sure all would concur very strongly! A MUST HAVE!

    2. By Anonymous Coward (124.104.46.67) on

      You said "When you see a rule near the top that is evaluated a lot but rarely matches, you can move it further down in the ruleset."

      I did execute "pfctl -vsr". How to identify if it matches?

  11. By Venture37 (217.160.111.190) venture37 # hotmail DOT com on www.geeklan.co.uk

    Daniel, I though your site was ciscosux.com ;)

  12. By Matt Van Mater (69.255.1.181) on

    An excellent read, well done as always.

    One question though: You mention the x86 architecture has a limit of about 10,000 interrupts per second. Do you literally mean all x86 platforms? I was wondering if something like a PCI-X bus would allow performance increases in the number of interrupts per second. I'm curious to know where that constraint comes from.

    I know this was written a while back before the new PCI-X bus was widely available and am curious if that part still holds true.

    Comments
    1. By Daniel Hartmeier (62.65.145.30) daniel@benzedrine.cx on http://www.benzedrine.cx/dhartmei.html

      > One question though: You mention the x86 architecture has a limit of about 10,000 interrupts per second. Do you literally mean all x86 platforms? I was wondering if something like a PCI-X bus would allow performance increases in the number of interrupts per second. I'm curious to know where that constraint comes from.

      As I understand it (someone correct me if I'm wrong), the bus used does not increase the rate at which the CPU can handle interrupts.

      The point of that section wasn't to imply non-i386 should be used as architecture (that would be quite theoretical, as certainly most people do run i386), but to understand the interrupt limit and invest in NICs that can bundle multiple packets into one interrupt. Most gigabit NICs do that. So, even if you filter only 100 Mbps links, getting a gigabit NIC can be worth the money because of this feature.

      If you're getting a PCI-X NIC, I suspect it will be a comparatively expensive gigabit NIC (compared to an PCI 100 MBps NIC) capable of interrupt bundling, so it will perform better, but not by increasing the interrupt rate limit, but by generating less interrupts for the same number of packets. If PCI-X itself guarantees that (like, by generating less interrupts compared to PCI), I don't know about that, maybe one of the hardware hackers can comment ;)

      Comments
      1. By Martin Toft (130.225.243.84) mt@martintoft.dk on

        > The point of that section wasn't to imply non-i386 should be used as
        > architecture (that would be quite theoretical, as certainly most
        > people do run i386), but to understand the interrupt limit and invest
        > in NICs that can bundle multiple packets into one interrupt. Most
        > gigabit NICs do that. So, even if you filter only 100 Mbps links,
        > getting a gigabit NIC can be worth the money because of this feature.

        Having browsed the market intensively for cheap gigabit NICs, D-Link DGE-530T looks promising (supported by sk(4)). Can you (or anobody else reading this) confirm that this card supports interrupt bundling?

        Should I look for any magic feature names in the NICs specifications to determine this?

        D-Link's page for the NIC: http://tinyurl.com/fr5xw

        Comments
        1. By Anonymous Coward (151.188.247.104) on

          > > The point of that section wasn't to imply non-i386 should be used as
          > > architecture (that would be quite theoretical, as certainly most
          > > people do run i386), but to understand the interrupt limit and invest
          > > in NICs that can bundle multiple packets into one interrupt. Most
          > > gigabit NICs do that. So, even if you filter only 100 Mbps links,
          > > getting a gigabit NIC can be worth the money because of this feature.
          >
          > Having browsed the market intensively for cheap gigabit NICs, D-Link DGE-530T looks promising (supported by sk(4)). Can you (or anobody else reading this) confirm that this card supports interrupt bundling?
          >
          > Should I look for any magic feature names in the NICs specifications to determine this?
          >
          > D-Link's page for the NIC: http://tinyurl.com/fr5xw


          After what D-Link just did to Harald Welte and the other contributors to the netfilter code (Welte had to sue them because D-Link decided that the GPL "is not legally binding"), I wouldn't buy anything by D-Link. Thankfully, German courts don't seem to be bought and paid for as much as some of the ones here in the USA.

          http://gpl-violations.org/news/20060922-dlink-judgement_frankfurt.html

          Even if it's a good product, I wouldn't give them my money, same as I don't for Broadcom or (these days) Apple. And it has nothing to do with "the GPL r0xx/sUxx" or any other such flamebait. It's about D-Link's refusal to follow someone else's license just because Welte is a "little guy." Not cool.

  13. By Daniel Melameth (63.227.26.190) daniel@melameth.com on

    Daniel,

    A big thank you for yet another piece of "PF gold"...

    Cheers.

  14. By Massimo Lusetti (83.211.174.86) on

    Thanks Daniel, great news this one!

  15. By J. (194.109.22.149) on

    Very interesting, thanks for sharing Daniel. On OpenBSD 3.9, I tried to optimize my ruleset using pfctl -o however oddly enough at some point I get syntax errors. When I load my current ruleset, it works perfecly fine. The optimized one doesn't. The rules which don't load all contain keep state probability 65% (if-bound) queue(std_out, std_ack) while other rules without any probability defined are loaded correctly.

    (Obviously, I can't put my ruleset here.)

    Comments
    1. By Nate (65.95.229.97) on

      > Very interesting, thanks for sharing Daniel. On OpenBSD 3.9, I tried to optimize my ruleset using pfctl -o however oddly enough at some point I get syntax errors. When I load my current ruleset, it works perfecly fine. The optimized one doesn't. The rules which don't load all contain keep state probability 65% (if-bound) queue(std_out, std_ack) while other rules without any probability defined are loaded correctly.
      >
      > (Obviously, I can't put my ruleset here.)


      Well, that's what pastebin is for.

      Comments
      1. By J. (194.109.22.149) on

        > > Very interesting, thanks for sharing Daniel. On OpenBSD 3.9, I tried to optimize my ruleset using pfctl -o however oddly enough at some point I get syntax errors. When I load my current ruleset, it works perfecly fine. The optimized one doesn't. The rules which don't load all contain keep state probability 65% (if-bound) queue(std_out, std_ack) while other rules without any probability defined are loaded correctly.
        > >
        > > (Obviously, I can't put my ruleset here.)
        >
        >
        > Well, that's what pastebin is for.

        Well, glad such services exist but I don't want my ruleset to be public. I'd e-mail it so dhartmei in private using GPG or sth though, if the previously mentioned does not suffice.

        Comments
        1. By sthen (82.153.166.138) on

          > Well, glad such services exist but I don't want my ruleset to be public.

          Your _actual_ ruleset doesn't matter, develop a minimal test-case that's just enough to demonstrate the problem, document it and use sendbug...

  16. By Ray Lai (199.67.138.42) ray@ on http://cyth.net/~ray/

    Basically, you have to increase the limit and verify the host remains stable after reaching that limit, by artificially creating as many entries.

    I think you meant, “by artificially creating many entries.”

    For a list of 1,000 addresses, the difference will be factor 1,000.

    “the difference will be a factor of 1,000.”

    Cool stuff, I would have liked to see it in print form.

  17. By Darren Spruell (206.132.94.6) on

    Daniel, brilliant submission, thank you. I just went through and applied some suggestions from this writeup and let pfctl's optimizer take a turn as well. The ruleset cleanup was great, and I can see the organization itself (rule ordering within the ruleset) makes much more sense now.

    Take care.

    DS

  18. By Chl (195.68.31.231) on

    Trivial typo :
    "when the next rule does not use the exact sime criterion"
    =>
    "when the next rule does not use the exact same criterion"

  19. By Igor Sobrado (156.35.192.2) on

    I am trying to understand a very annoying issue related with the document on ruleset optimization. From this document I understand that the network interface is the first parameter to be evaluated -- it makes sense, as grouping all rules related with a given interface will simplify stepping to the next set of rules (related with another network interface) if the packet analyzed is not arriving/being send on that interface. But, when looking at how pf optimizes rules in OpenBSD 3.7 (warning!!! it is probably a bit old!!!) I see this one:

    $ pfctl -nvvof example.fw
    set block-policy return
    set skip on { lo0 }
    @0 scrub in all fragment reassemble
    @1 block return all
    @2 pass out quick on fxp0 proto tcp from any to any port = ssh flags S/SA keep state
    @3 pass out quick on fxp0 proto tcp from any to any port = domain flags S/SA keep state
    @4 pass out quick on fxp1 proto tcp from any to any port = ssh flags S/SA keep state
    @5 pass out quick on fxp1 proto tcp from any to any port = domain flags S/SA keep state
    @6 pass out quick proto tcp from any port > 49151 to any flags S/SA keep state
    @7 pass out quick on fxp0 proto udp from any to any port = domain keep state
    @8 pass out quick on fxp1 proto udp from any to any port = domain keep state
    @9 pass out quick proto udp from any port > 49151 to any keep state
    @10 pass in quick on fxp0 proto tcp from any to any port = ssh flags S/SA keep state
    @11 pass in quick on fxp1 proto tcp from any to any port = ssh flags S/SA keep state

    This means, that rules must be written in this order:

    set block-policy return
    set skip on { lo0 }

    scrub in all

    block all

    pass out quick on fxp0 proto tcp to port ssh flags S/SA keep state
    pass out quick on fxp0 proto tcp to port domain flags S/SA keep state
    pass out quick on fxp1 proto tcp to port ssh flags S/SA keep state
    pass out quick on fxp1 proto tcp to port domain flags S/SA keep state

    pass out quick proto tcp from port > 49151 flags S/SA keep state

    pass out quick on fxp0 proto udp to port domain keep state
    pass out quick on fxp1 proto udp to port domain keep state


    pass out quick proto udp from port > 49151 keep state

    pass in quick on fxp0 proto tcp to port ssh flags S/SA keep state
    pass in quick on fxp1 proto tcp to port ssh flags S/SA keep state

    (I would expect all rules for fxp0 grouped together, all rules for fxp1 grouped together and the rules not related with a given interface being grouped together either at the beggining or at the end of the ruleset.)

    Am I missing something?

    Cheers,
    Igor.

    Comments
    1. By Daniel Hartmeier (62.65.145.30) daniel@benzedrine.cx on http://www.benzedrine.cx/dhartmei.html

      > I am trying to understand a very annoying issue related with the document on ruleset optimization. From this document I understand that the network interface is the first parameter to be evaluated -- it makes sense, as grouping all rules related with a given interface will simplify stepping to the next set of rules (related with another network interface) if the packet analyzed is not arriving/being send on that interface. But, when looking at how pf optimizes rules in OpenBSD 3.7 (warning!!! it is probably a bit old!!!) I see this one:

      It looks like the optimizer is correctly deciding that all your pass rules form a single re-sortable block, and sorts the rules within on

      1) direction (in/out)
      2) protocol (tcp/udp)
      3) interface (fxp0/fxp1/any)

      The sorting done by the optimizer is not in a strict order of fields (like the one the article suggests to do manually, i.e. first on interface, then on direction, then on protocol, etc.). Instead, it weighs the positions of fields according to the skip step values, and picks the combination with maximum sum.

      I think the theory is that this is statistically better, but I can't produce a convincing argument (or well-chosen example) why it is, indeed. You could produce a statistically representative sample of packets (like 50%:50% in vs. out, 50:50 tcp vs. udp, 30:30:30 fxp0/1/other) and count how many rules are skipped on average for the two cases (strict order vs. what pfctl -o does).

      The code for this is in /usr/src/sbin/pfctl/pfctl_optimize.c reorder_rules(), and if you find a statistically/mathematically better algorithm, I guess Mike Frantzen (frantzen@) would be to person to talk to about it.

      Comments
      1. By Igor Sobrado (156.35.192.4) on

        > It looks like the optimizer is correctly deciding that all your pass rules form a single re-sortable block, and sorts the rules within on
        >
        > 1) direction (in/out)
        > 2) protocol (tcp/udp)
        > 3) interface (fxp0/fxp1/any)
        >
        > The sorting done by the optimizer is not in a strict order of fields (like the one the article suggests to do manually, i.e. first on interface, then on direction, then on protocol, etc.). Instead, it weighs the positions of fields according to the skip step values, and picks the combination with maximum sum.

        Ok, thanks a lot for this explanation. It is very interesting and, certainly, not intuitive. I suppose that we should sort the rules as recommended by the optimizer except in the case we certainly found a better algorithm, am I wrong?

        For a third interface, with the rules sorted in this way:

        set block-policy return
        set skip on { lo0 }

        scrub in all

        block all

        pass out quick on fxp0 proto tcp to port ssh flags S/SA keep state
        pass out quick on fxp0 proto tcp to port domain flags S/SA keep state
        pass out quick on fxp1 proto tcp to port ssh flags S/SA keep state
        pass out quick on fxp1 proto tcp to port domain flags S/SA keep state
        pass out quick on fxp2 proto tcp to port ssh flags S/SA keep state
        pass out quick on fxp2 proto tcp to port domain flags S/SA keep state
        pass out quick proto tcp from port > 49151 flags S/SA keep state

        pass out quick on fxp0 proto udp to port domain keep state
        pass out quick on fxp1 proto udp to port domain keep state
        pass out quick on fxp2 proto udp to port domain keep state
        pass out quick proto udp from port > 49151 keep state

        pass in quick on fxp0 proto tcp to port ssh flags S/SA keep state
        pass in quick on fxp1 proto tcp to port ssh flags S/SA keep state
        pass in quick on fxp2 proto tcp to port ssh flags S/SA keep state

        (I have just added a third interface -fxp2- to the ruleset), the rules are sorted in a very different way:

        $ pfctl -nvvof tmp.txt
        set block-policy return
        set skip on { lo0 }
        @0 scrub in all fragment reassemble
        @1 block return all
        @2 pass out quick on fxp0 proto tcp from any to any port = ssh flags S/SA keep state
        @3 pass out quick on fxp1 proto tcp from any to any port = ssh flags S/SA keep state
        @4 pass out quick on fxp2 proto tcp from any to any port = ssh flags S/SA keep state
        @5 pass out quick on fxp0 proto tcp from any to any port = domain flags S/SA keep state
        @6 pass out quick on fxp1 proto tcp from any to any port = domain flags S/SA keep state
        @7 pass out quick on fxp2 proto tcp from any to any port = domain flags S/SA keep state
        @8 pass out quick proto tcp from any port > 49151 to any flags S/SA keep state
        @9 pass out quick on fxp0 proto udp from any to any port = domain keep state
        @10 pass out quick on fxp1 proto udp from any to any port = domain keep state
        @11 pass out quick on fxp2 proto udp from any to any port = domain keep state
        @12 pass out quick proto udp from any port > 49151 to any keep state
        @13 pass in quick on fxp0 proto tcp from any to any port = ssh flags S/SA keep state
        @14 pass in quick on fxp1 proto tcp from any to any port = ssh flags S/SA keep state
        @15 pass in quick on fxp2 proto tcp from any to any port = ssh flags S/SA keep state

        I moved that ruleset to a computer that runs OpenBSD 3.9 at my home, and rules are sorted in the same way. It seems a stable algorithm that has not changed in the last OpenBSD releases.

        > I think the theory is that this is statistically better, but I can't produce a convincing argument (or well-chosen example) why it is, indeed. You could produce a statistically representative sample of packets (like 50%:50% in vs. out, 50:50 tcp vs. udp, 30:30:30 fxp0/1/other) and count how many rules are skipped on average for the two cases (strict order vs. what pfctl -o does).

        It is a very difficult task! The best algorithm is that one that better fits the most probable traffic distribution on *real* networks. Certainly studying different sources on multi-interface routers is required. I will try to work on this problem, it can be an excellent goal for a paper (I am thinking on the ACM Computer Communication Review, IEEE Computer Networks, or an ACM SIGCOMM right now). If I finally find something interesting we can write a paper (or, at least, I would be glad to add your name to the acknowledgments if you do not have time to work on this manuscript). But, as you say, we need a statistically representative sample of packets. A sample of packets that fits real traffic distribution on standard networks.

        > The code for this is in /usr/src/sbin/pfctl/pfctl_optimize.c reorder_rules(), and if you find a statistically/mathematically better algorithm, I guess Mike Frantzen (frantzen@) would be to person to talk to about it.

        I will do my best to get a good paper, I am not only a Ph.D. with a doctoral dissertation on theoretical Computer Science, but also a physics graduate. I should be able to make an acceptable statistical analysis of the traffic. An improved algorithm would be a good contribution to the OpenBSD pf. But I am sure the current algorithm is an excellent one.

  20. By Igor Sobrado (81.32.128.56) on

    Nearly an off-topic but... some days ago Ryan McBride announced that, based on a suggestion made by Daniel Hartmeier, "flags S/SA keep state" will be the default on pf from now (I suppose that we must wait until 4.1 to see this change on production machines, right? This change arrived too late for 4.0, though.) Very very good news, indeed! Now we can write single-line rules easily that will make rulesets more readable.

    But... can someone explain what will happen to UDP-related rules? Will these rules be now "keep state" by default? As a connectionless protocol, UDP has a simple header format that does not support connection establishment ("flags" makes no sense here), but... "keep state" is very useful in ruleset optimization on UDP too. Will UDP related rules be "keep state" by default? There is nothing on the original announcement that refers to changes in UDP rules behaviour.

    Cheers,
    Igor.

    Comments
    1. By Martin Gignac (64.228.225.54) martin.gignac@gmail.com on

      > But... can someone explain what will happen to UDP-related rules?
      > Will these rules be now "keep state" by default?

      Yes, "keep state" is added by default to the end of UDP and
      ICMP-centric rules (without the flags S/SA, of course)

      -Martin

  21. By Master (184.56.6.119) on

    Nice article; just found it...in 2015!!

    Still learned some stuff about pf and I've been using it for years; very well written...thank you.

Credits

Copyright © - Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]