Contributed by pitrh on from the tag the puffy dept.
I intended to start the hackathon by finishing off a diff to add hardware VLAN tagging/stripping support for VT6105M chips in vr(4) then moving on to something else. Although I'm not a kernel or hardware hacker, I already had some mostly working code, the data sheet and a test device. How long could this take?The VT6105M is one of the last revisions of the reasonably simple 10/100 VIA Rhine family of 10/100 Ethernet chips. It's used in, amongst other things, the PCEngines ALIX and Soekris net5501 devices. It's capable of doing 802.1Q VLAN tagging and untagging in hardware, however OpenBSD's driver did not support that, and neither did any of the other BSDs.
Background
The VIA Rhine chips have a bunch of configuration registers to set up the chip, plus some "descriptors" representing the Ethernet frames being sent and received. Each descriptor is 16 bytes and contains a bunch of flags describing the packet (two 32-bit words) and two pointers, one to the Ethernet frame data and one to the next descriptor (another two 32-bit words). In the OpenBSD driver, the descriptors are arranged into two rings of 128 descriptors, one for transmit and one for receive, and the driver and the chip fill and empty the rings. There's a bit in each descriptor indicating whether the chip or the driver currently owns a given descriptor.
Hardware VLAN tagging
Figuring out how to send VLAN tagged frames was relatively straightforward: the data sheet shows how to set the VLAN ID (and the related priority bits) by setting bits 16-28 in the first word of the TX descriptor ("TDES0"). Observing the emitted packets via tcpdump on another machine, I was puzzled to see that while they were indeed tagged, they were all in VLAN 0. This turned out to be due to this somewhat odd construction in the driver where it changed the ownership bit to turn it over to the hardware:
#define VR_TXSTAT_OWN 0x80000000 #define VR_TXOWN(x) x->vr_ptr->vr_status VR_TXOWN(cur_tx) = htole32(VR_TXSTAT_OWN);which expands to:
cur_tx->vr_ptr->vr_status = htole32(0x80000000);The VLAN ID is in the same word as the owner bit, so that effectively zeroed my carefully populated VLAN bits. Whoops. After changing that to a bit operation I could see correctly tagged VLAN frames. Chris Cappuccio cleaned that up while reworking the driver so it was already fixed by the time of the hackathon.
While figuring out from the data sheet how to transmit tagged packets was straightforward, figuring out to receive them was not. There's a bit in the RX descriptor that tells you whether or not a given frame was tagged, however there's nothing in the data sheet that describes where the VLAN ID is actually stored. Fortunately, the Linux driver already supported hardware VLAN tagging, and they had a nice comment describing where to find it:
* If hardware VLAN tag extraction is enabled and the chip indicates a 802.1Q * packet, the extracted 802.1Q header (2 bytes TPID + 2 bytes TCI) is 4-byte * aligned following the CRC.
Why did they do this? My guess is that it's because they'd run out of spare bits in the RX descriptor. Why didn't they at least include this information in the data sheet? Beats me.
I'd done most of this before the hackathon, so after some cleanup I had working code, and a bit later after some feedback from various folks it was tidied up and committed. Job done.
A minor optimization
Well, not quite. While poking around in the guts of the driver, I noticed a small possible optimization in the transmit path: when the chip's queue is full, it'd try to add some more packets which would fail, but then poke the chip to tell it to start anyway, which was unneccessary since nothing had changed. Keeping a local counter of packets added to the queue allowed us to avoid a PCI bus write, which helped a little (about 0.5% lower CPU usage in my tests, which is admittedly within the margin of error). That went in too. It turns out FreeBSD already did something similar.
A major optimization
The OpenBSD driver requests an interrupt for each packet transmitted or received. Interrupts are expensive, so this per-packet overhead is significant.
FreeBSD has implemented interrupt reduction on the transmit path: instead of requesting an interrupt for every packet they request one every eight packets by only setting the "interrupt control" bit (TDES0 bit 23) on every eighth packet. Chris had previously tried this and saw no improvement but suggested that I have a try. I did, based on what FreeBSD did and like Chris saw no change on my ALIX.
Being stubborn, I spend the next couple of days poking around in the driver, building booting kernels, browsing the data sheet and running benchmarks. Around this time, I realised that my "baseline" numbers were from a kernel built without POOL_DEBUG turned on while the test kernels had it, which invalidated the comparison and caused me to re-run a number of tests.
Eventually, I noticed the following entry in the datasheet for TDES3, which is the pointer to the next descriptor in the ring:
Bit 0: TDCTL[0]. Interrupt Control.
0 = issue interrupt for this packet
1 = no interrupt generated
Wait, what? That seems a lot like the bit we're already using (TDES1 bit 23):
Bit 23: IC. Interrupt Control
0: No interrupt when Transmit OK
1: Interrupt when Transmit OK
Why are there two bits doing what seem to be the same thing (although in opposite directions) in one 128 bit entry? Beats me. And why is one of them in the low bits of a pointer? Beats me too (although since the descriptors are going to have to be aligned to at least a 4-byte boundary, I guess they can ignore the least significant bits in the address and get away with it).
I changed the code to set these "interrupt disable" bit on most of the packets while keeping the "interrupt request" bit on every eighth packet and all of a sudden things started to look better! Here's a summary of what systat showed while pushing about 85Mbit/s from userspace on my ALIX, before:
Interrupts 35.7%Int 37.2%Sys 0.8%Usr 0.0%Nic 26.4%Idle 10555 total | | | | | | | | | | | 10323 vr0 ||||||||||||||||||==================> 7215 IPKTS 14423 OPKTSand after:
Interrupts 29.2%Int 33.1%Sys 0.0%Usr 0.0%Nic 37.7%Idle 4599 total | | | | | | | | | | | 4370 vr0 |||||||||||||||================ 7204 IPKTS 14403 OPKTS
And similarly for routing 85Mbit/s of TCP through it, before:
Interrupts 66.2%Int 0.0%Sys 0.8%Usr 0.0%Nic 33.1%Idle 18469 total | | | | | | | | | | 10241 vr0 ||||||||||||||||||||||||||||||||| 8001 vr1 11069 IPKTS 11062 OPKTSand after:
Interrupts 45.7%Int 0.0%Sys 0.0%Usr 0.0%Nic 54.3%Idle 12012 total | | | | | | | | | | 7709 vr0 ||||||||||||||||||||||| 4072 vr1 11011 IPKTS 10992 OPKTS
A useful improvement: 15% to 30% reduction in CPU usage for the same workload. Since the change only affects the transmit path, the number of interrupts for the packets received (both the data and the TCP ACKs) is still the same.
I was unable to get more than about 85Mbit/s out of a single interface on my ALIX, however it'd happily route that. I was able to get 72Mbit/s from userspace out two different interfaces for a total of 144 Mbit/s. Even so, freeing up the CPU for other things (such as running PF, since these devices are often used as firewalls) is still useful.
Receive-side interrupt mitigation
To mitigate the interrupts on the receive side the chip would normally have a "holdoff timer" that causes it to delay interrupting the CPU for some amount of time after a packet is received, in case more packets arrive shortly afterward. This does add some latency, but also reduces the interrupt overhead significantly. Unfortunately, as far as I can tell the VT6105M does not support this feature, and I spent the rest of the time at the hackathon fiddling around with the chip's programmable interval timer, trying unsuccessfully to provide some mitigation on the receive side.
Conclusion
And that's how I spent most of a week tweaking the driver for a 10-cent Ethernet chip.
Being a complete kernel n00b, being able to get help from the folks who know this stuff was extremely useful, and face-to-face has a lot less turnaround time than email. I'd like to thank brad, chris, dlg, jsing, mikeb and sthen for putting up with my questions (and blunders) with good humour.
I usually work on userspace software for fun or large scale systems for work so flipping individual bits on actual hardware was a change for me and quite interesting although, at times, frustrating. I'd like to thank the University of Otago and in particular Jim Cheetham for making it possible.
(Comments are closed)
By Matthieu Herrb (mherrb) matthieu@openbsd.org on
Comments
By Janne Johansson (jj) on http://www.inet6.se
Yes, fixed. Thanks for pointing it out.
By Darren Tucker (dtucker) dtucker@openbsd.org on
The bit that the VT6105M needed is not in all chipset revisions' data sheet and I don't have devices to test it.
If you'd like to try it, compare baseline performance vs adding VR_Q_INTDISABLE to the vr_devices vr_quirks in /usr/src/sys/dev/pci/if_vr.c for your particular device. If you do, please let me know what the result was.