Contributed by jason on from the what-happened-to-self-healing-networks dept.
As I was sitting at my desk in mid-afternoon, I was surprised by an instant messenger from one of our media team members. This is out of the ordinary and generally means something is broken.
3:18:07 vince: http down? 3:18:37 jason: not that I'm aware of, but I noticed DNS was acting hokey. Why, is it down for you? 3:19:11 vince: yes. couldnt connect to dev 3:20:08 jason: I'll check it out
Years of networking experience have taught me to always start at the bottom [layer of the OSI model] and work upwards. This was no exception. A quick ping dev ruled out any problems with basic connectivity. I opened up a web browser and opened up a series of pages, both internal and external. For the most part, they timed out; however, a few tabs would open up slowly, while one opened up instantaneously (none of these sites were cached, so we can rule that out). There was obviously something amiss, and it was my job to track it down.
I was a DNS administrator in a previous life, so I'm wholly aware of the chaos that can be caused by a misbehaving nameserver. We use two OpenBSD 3.9 systems as our authoritative and resolving DNS servers, serving up both internal and external Bind views with the default chrooted named daemon. This configuration has served us well, and the servers nary skip a beat serving up hundreds of requests per second. I started out by running some queries against each server. The responses mirrored the activity we experienced during web browsing; often, the queries would respond in 2-3 seconds, rarely they would respond immediately, sometimes they would simply timeout. This was consistent across both of the DNS servers, regardless of the query target (internal or external recursion).
No changes had been made to any of the DNS zones or Bind configuration within weeks, so any sort of typo was ruled out as a cause. We were starting to consider a Denial-Of-Service attack, based on the inconsistent behavior. I opened up an OpenSSH connection to the primary master nameserver and snooped around. Process lists (ps -ax), network status (netstat -i -I em0 1) and kernel activity (vmstat 1) reports all came back normal. Processor load was virtually nothing. And yet, the conditions worsened... and more users started calling.
Before we continue, I'd like to give a brief overview of our network design. When I took over the infrastructure two years ago, the networks were a mishmash of six isolated LAN segments, each with their own dedicated Cisco PIX connected to the ISP WAN. There was no traffic accounting nor Quality-of-Service queuing to ensure each department received the bandwidth they needed to perform their tasks. The users are primarily developers and engineers, known to download large amounts of software (and streaming video) at their own discretion. It was also common practice to allow visiting clients and vendors to connect their laptops directly into the host network. On top of all this, the company was operating on a single T1 connection to the Internet. The frustration of daily user complaints over the network congestion soon turned to hope; hope that OpenBSD, PF, ALTQ and solid design fundamentals would ease my pain.
The CEO at our company is very accepting of open source software. It took very little convincing to get them to agree to a complete overhaul of our networks, starting with a pair of OpenBSD i386 firewalls. Each firewall contains a total of two external vlan (4) interfaces on em0 and 16 internal vlan interfaces on em1. The vlan interfaces also have a corresponding carp (4) interface which provides fail-over between the firewalls. Traffic states are synchronized thanks to pfsync (4), which is bound to sk0. All of the user networks are part of the "internal" interface group, allowing for easy policy-based filtering. Only traffic into the DMZ is allowed from all networks, while most of the networks are not allowed to route between themselves at all. This design effectively creates a number of developer "sandboxes" which we have much greater control over. While they continue to allow visitors inside their gated community, I can rest assured that any hazardous traffic will be isolated to their network, the DMZ, or the Internet. This encapsulation also simplifies any queuing structures that I wish to implement.
As we return to our hero, we remember that the situation was getting dire. While the patterns were predictable, it made little sense that traffic would work fine at times, while at other times it would respond slowly or not at all. All of the servers and switches appeared to be at normal capacity and operating normally. Even the traffic graphs created by symon revealed we were running at 50% of our 3-Mbps connection (we have since upgraded to a bonded T1 pair). All switch ports and servers are set to auto-negotiate, so that was ruled out as the culprit. My patience was running thin.
Up to this point, I had neglected to analyze any traffic on the wire, as it had appeared to be an application-layer problem. I decided to take a quick look at a tcpdump capture while monitoring the debugging output of named via syslog. Purely by chance, I chose to initiate the DNS query from one of the firewalls. What happened next came as a complete surprise.
While performing a tcpdump -ni em0 udp and host 10.0.0.1 and port 53 on the target nameserver, I issued a dig command from the firewall. To my disbelief, I saw nothing. Wait! There it is, three seconds later, the initial SYN request from the firewall, and an instantaneous response from the nameserver. For some reason, there was a delay for packets leaving the firewall to the nameserver. Immediately, I knew what was wrong.
# pfctl -s state | wc -l 10000 # pfctl -s memory | grep states states hard limit 10000
Sure enough, I had left the default state limit intact. The firewalls each come with 256MB of memory, but rarely exceed 50MB of real activity. I edited the pf.conf to add set limit states 20000, and issued a pfctl -O -f /etc/pf.conf to load the new options. Almost magically, network activity returned to normal. I sat back in my chair, breathed a deep sigh, and took in a healthy swig of caffeinated goodness.
Leaving for the day, I made sure to brief our CEO on the day's misadventure.
5:22:03 bill: everything looks good now 5:23:45 jason: yeah, one of the admins tripped on a cable. problem solved. 5:23:59 bill: ok, thanks 5:24:17 jason: :)
(Comments are closed)