OpenBSD Journal

mpi(4): the story of a driver

Contributed by dlg on from the did-it-again-did-it-better dept.

Like marco says, mpi is a rewrite of the mpt driver that supposedly supported LSI Logic controllers that use the Message Passing Interface for communication between the operating sytem and the hardware. A lot of people have asked me why I chose to rewrite the driver rather than improving the existing one. Actually, everyone except Theo has asked me that (seriously everyone, even Marco asked), so this post is an attempt to explain why.

The mpt(4) driver is supposed to support LSI controllers that use "MPT-Fusion" technology, which is a surprisingly large family of cards that talk SCSI, Fibre Channel, and more recently Serial Attached SCSI. iSCSI variants exist, but I've never seen one so I'll conveniently ignore them for now. The cool thing about all this hardware is that the interface and messages are the same between each different flavor because the devices you plug into them all talk SCSI commands. The hardware abstracts the differences between the buses away so your driver just has to deal with straight SCSI commands. Technically your driver doesn't have to know about what type of controller it's driving, it can do SCSI IO as soon as you initialise the hardware. However, to perform well there is some configuration that you should do for each type of bus. Some of the controllers also have basic RAID support which requires some extra handling as well.

Despite all this cool hardware being available, mpt fell short of support for it.

I got interested in the mpt driver when I got an LSI fibre channel hba. I wanted to stick it in a sparc64 machine and drive an external RAID box with it, so I dropped the controller in the box and came up with my first problem: the driver wasn't in GENERIC on sparc64. So I enabled it thinking it would be the same as my experience with ahc, ie, it would just work and I could get a free commit. Oh, how wrong I was. mpt wasn't endian safe, so the driver was basically writing garbage to the controller and freaking out when it didn't work.

My intention was to fix this so I started by enabling all the debug output, capturing it, and then comparing it to the output generated by the same card on my one and only i386. This was when I discovered that my fibre channel controller didn't really work on i386 either. So I started to read the driver for real (as opposed to looking for the debug enabling bits) and got really scared.

mpt is about 10000 lines of spaghetti spread across 7 different files in our source tree. The majority of the functionality is split between two files, which is separated into an operating system independant file and the OpenBSD specific chunk. The os independant chunk deals with the low level stuff like initialising the hardware and getting commands on and off the firmware, but it gets away with doing this without using the APIs OpenBSD provides for low level hardware access by using a ton of macros and a portability layer implemented in the os dependant chunk. Not only does this make the code larger, it also makes it harder to read since all the functions and APIs used are specific to this driver. You constantly have to refer back to the portability goo to figure out what it's really doing. On top of this the driver is using a horribly verbose set of defines for the hardware commands. Literally half the lines of code in mpt are in the hardware defines. They're hard to read as well since the writers decided that typedefs are awesome.

This is different to the code I'm used to dealing with. For example, the ami driver is about 3500 lines of code in 4 files, and it actually does more work than mpt since it emulates SCSI, and handles the bioctl RAID functionality. mpts 10000 lines of code seems insane in comparison.

I read enough of the code to realise that it was written only with i386 in mind. It didn't do the required byteswapping to work on big endian archs, none of the hardware structures were packed so strict alignment archs were automatically broken, it's use of the bus_dma api was too minimal for it to work on machines without cache coherency, and it wasn't 64bit clean. There was a ton of really "interesting" code all through it that just didn't make sense to me. My favourite "what the?" code in mpt is the scatter gather list loading, and mpt_check_xfer_settings().

I honestly tried to figure out why my FC controller didn't work on i386 at least, but the code proved too impenetrable. So I left it till people started to complain that our RAID support on these controllers sucked, and that SAS was totally unsupported.

Unfortunately the fact that the SCSI, FC, and SAS controllers are basically the same seemed to have been ignored by the mpt driver. If you look at the PCI attach glue in src/sys/dev/pci/mpt_pci.c you'll see a big table of PCI device ids with the type of bus hardcoded next to each device. Since mpt was written back when only the SCSI and FC varients were around it's wired up to only deal with those two types of controller, and by my experience it only really dealt with SCSI. Another thing that the mpt driver was supposed to deal with, but didn't, was the RAID support. It detected that there were RAID volumes on the controller, but screwed up the fetching of the RAID info, which was treated as a fatal error and nothing worked. Considering that I couldn't even get a "supported" FC controller working, I didn't really consider it possible that I could fix that either, or add SAS.

A lot of people started to ask about SAS support, especially after the release of the Sun x4100 and x4200 systems, and it got to a point where marco and I decided to have a go at fixing mpt again. We asked for some hardware for me to hack on, and I ended up with a couple of controllers and then not enough time to do anything useful with them. I kept poking around in mpt, but the code kept making me sad.

Marco wanted me to merge our driver with the FreeBSD driver, which had supposedly gained better SAS and RAID support. I don't actually know if this is true, I never tried running it on the hardware to see if that worked. I'm cynical cos our driver supposedly supported FC and RAID. Marco's reasoning was we could get the functionality first and then start to clean up the driver.

However, despite the extra hardware support, my opinion was that the FreeBSD driver suffered from a lot of the same problems that our driver suffered from, namely it was huge and hard to read (in freebsd mpt is about 20 thousand lines), and engineered heavily toward running on i386. On top of this you have a different API for working with the hardware, and our SCSI midlayers are completely different. Porting the driver would have been a significant effort. After the port we would have had a driver that worked on i386 and amd64, and supported SCSI, FC, and SAS.

Then after we ported it we intended to go through the code and rip it apart and put it back together to fit into how OpenBSD works, much like we did for ami. At the end of that I was hoping we'd have a lean driver that could work on multiple architectures. Following that we would add the bio parts to support the RAID functionality. To summarise, we were going to go through two and a half major code efforts to get mpt into shape.

I wasn't keen on doing all this work, so I decided to write my own driver from scratch and get it right the first time round. ie, only one major code effort. Thus, mpi(4) was born. I started work on the new driver toward the end of last year, and kept tinkering with it. Unfortunately I didn't have much time to spend on it, so progress was slow, but enough to keep me motivated. I did the initial work on a sparc64. By the time I got to Canada for the hackathon I had the initialisation and low level hardware access routines done and working on both i386 and sparc64. Like I said, this wasn't much but it was enough to make me think it was a viable start to an alternative to mpt.

Just before the hackathon is about when I got serious. I spent the week before the hackathon writing the asynchronous command handling paths (basically all SCSI IO uses this) and the glue between that and our SCSI midlayer. The day before the hackathon, I had SCSI commands working during autoconf, so I was able to attach disks that were on the SCSI bus, but reads and writes to the disks weren't working.

I knew the code could work, but I couldn't see my bug, so I went public with my code and the fact that I wanted to replace mpt, not fix it. I mailed mpi around to krw, marco, and deraadt to see if they could figure out what I was doing wrong. Marco was not impressed and immediately replied with "port freebsd and then we'll fix it". Uncool.

Luckily we managed to talk Marco round. As soon as he read the new code the arguments for it fell into place and he was then all for it. He spotted my bug (it was a four line diff, I was so pissed off) and all of a sudden mpi was doing IO. That night was the first night of the hackathon, and that was the first day that mpi could newfs and mount a filesystem. Not only on i386, but on sparc64 as well.

The day after I imported mpi into the tree and started fixing issues. A lot of the hackathon was spent working on making the scatter gather lists work properly (something I still didn't get right till a week or two ago), starting on PPR and cleaning up a few endian issues that I'd missed while working on i386 (blah). Over the course of the week I got to test the driver on a variety of hardware including a VIA C3 based system, a Sun v120, and a HP DL145. However, I have to say my favourite test machines was the Blade 1000 and the Sun x4100 (thanks Carson).

The Blade 1000 was exciting because it was actually the first disk controller that the UltraSparc III systems booted and rooted off. Originally Jason and I tried it just to be smart arses when we figured out that the SAS controller I had had fcode on it. This meant that openfirmware could boot off it. Before mpi there was no support for the LSI SAS controllers, let alone support on sparc64. So we tried it and it worked.

The x4100 was fun because it was the first machine to be installed with mpi as the controller for the root disk, and it was the machine most people were asking about with regard to support for its storage controller. I'd accidentally left the debug output on in the bsd.rd image, which meant that every time you did disk IO you'd get about 12 lines of kernel output on the screen. It slowed things down, but it was fun when we booted into the new system and everything Just Worked(tm).

And they worked fast.

Anyway, that's enough from me. Sorry again. I'd just like to recap on why I think mpi is better than mpt:

  • It's currently 3500 lines of code, compared to the 10000 for mpt, and the 20000 for mpt in FreeBSD. When I first got it going it was a mere 2100 lines of code. It's grown some code for dealing with PPR on SCSI controllers, and a start on some RAID support since then though. Lines of code is definitely an example of where less is more. Less code means less bugs, and less bugs means more happy for me and everyone trying to use the driver.
  • It's 64bit clean, mpt was written to only work on machines with 32bit physical addresses.
  • It uses structs that are safe for use on strict alignment architectures.
  • It uses the bus_dma API correctly, which makes it safe for use on machines without cache coherency.
  • It's endian safe, so I can actually use my FC controller on my sparc64s now (or after kettenis fixes interrupt mapping on it).
  • It's only 4 files, and all the real code is in only one of them. This code is readable.
  • It's bloody fast. I use as little locking as possible, and there are no busy waits in the IO paths.
  • I get to delete code from the repository, which is actually more fun than adding code. Trust me. Replacing the old ses driver with a new ses and safte drivers was awesome too.
  • People get to stop bitching about how OpenBSD doesn't run on new SAS based systems.
  • It supports SCSI, FC, SAS, and now VMware as well. Any new flavors should just work as well, assuming they don't deviate from the basic init and io messages that mpi uses.

I would like to thank Srebrenko Sehic, Shane Pearson, and Rene Badalassi for sending me hardware to develop on. Also to deraadt@, beck@, krw@, jason@, and marco@ for helping out with code and testing at the hackathon. It was fun.

(Comments are closed)


Comments
  1. By Marc Balmer (157.161.101.131) on http://www.msys.ch/

    I am running with the new mpi(4) driver for some days and I must say it's really nicely done!

  2. By EN (83.248.138.152) en@openbsd.nu on http://www.openbsd.nu

    Great work!

    "If you want it right - you have to do it yourself!"

  3. By Anonymous Coward (204.244.250.2) on

    Does anyone have any performance metrics showing the speed delta between the two drivers?

    Comments
    1. By Anonymous Coward (198.208.251.24) on

      > Does anyone have any performance metrics showing the speed delta between the two drivers?

      Frankly, if mpi was slower, I would still use it anyway.

    2. By Bryan Inderhees (67.39.209.1) bpi+deadly@case.edu on

      By my reading of David's justification, you'd probably stumble into a divide-by-zero error while comparing the two...

  4. By Simon Dassow (213.39.205.12) janus (at) errornet (dot) de on http://janus.errornet.de

    Wow! What an interesting read about your impressive work.
    It shows the net effect of sane refactoring methods i'd say :-)

    Thanks for sharing this nice kind of experience and the absorbing write up.

    Regards,
    Simon

  5. By Jason Wright (65.202.219.66) jason@openbsd.org on http://www.thought.net/jason

    Actually, it was a blade2000 that we were running with at the hackathon. Supporting mpi(thanks to dlg's work) was actually easier than supporting the onboard isp as a root device (that required all kindsa nasty autoconf goop to match the fibre channel WWPN given by the prom to the actual disk device). All we did to get mpi working was to increase a timeout value: everything just worked after that (why it takes slightly longer to initialize on the blade2k, I have no idea).

    Comments
    1. By David Gwynne (220.245.180.130) loki@animata.net on

      > Actually, it was a blade2000 that we were running with at the hackathon.

      crap, yeah. sorry.

    2. By David Gwynne (220.245.180.130) loki@animata.net on

      > All we did to get mpi working was to increase a timeout value: everything just worked after that (why it takes slightly longer to initialize on the blade2k, I have no idea).

      That's actually an issue with the SAS controllers, not with running mpi on sparc64 per se. Up until we dropped the SAS board into your blade2k I hadn't really tested my code on the SAS controllers, so I didn't really know if it would work or not.

      Comments
      1. By Jason Wright (24.254.95.239) jason@openbsd.org on http://www.thought.net/jason

        > That's actually an issue with the SAS controllers, not with running mpi on sparc64 per se. Up until we dropped the SAS board into your blade2k I hadn't really tested my code on the SAS controllers, so I didn't really know if it would work or not.

        SAS just takes longer to initialize? I think I'll stick to host bridges and network devices, at least the insanity is well understood =)

        Comments
        1. By David Gwynne (130.102.78.195) loki@animata.net on

          > SAS just takes longer to initialize? I think I'll stick to host bridges and network devices, at least the insanity is well understood =)

          Yeah, basically.

          Can you understand your gear better because you have doco?

  6. By Anonymous Coward (84.130.196.47) on

    Thanks a lot for your impressive work and this detailed write up!

  7. By Shane J Pearson (202.45.125.5) on

    I would like to thank Srebrenko Sehic, Shane Pearson, and Rene Badalassi for sending me hardware to develop on.

    You are very welcome David. When I can spare the money or hardware, doing so is easy. The hard part and big effort is what you and the other OpenBSD devs do, which I am very thankful for. In return, I get back so much more than all I could ever give.

    So a big thanks goes to you and the rest of the team!

    Shane

  8. Comments
    1. By David Gwynne (130.102.78.195) loki@animata.net on

      > Well, what do you know, it "just works" on alpha too. :)

      I dare someone to find me an architecture with PCI that mpi(4) doesn't work on.

      Comments
      1. By Miod Vallat (82.195.186.223) miod@ on

        > > Well, what do you know, it "just works" on alpha too. :)
        >
        > I dare someone to find me an architecture with PCI that mpi(4) doesn't work on.

        Did you try it on an O2 yet?

        Comments
        1. By Anonymous Coward (220.245.180.130) on

          > > > Well, what do you know, it "just works" on alpha too. :)
          > >
          > > I dare someone to find me an architecture with PCI that mpi(4) doesn't work on.
          >
          > Did you try it on an O2 yet?

          bsd.rd panics during the install, so no. i'd like to though.

          Comments
          1. By David Gwynne (220.245.180.130) loki@animata.net on

            > > Did you try it on an O2 yet?
            >
            > bsd.rd panics during the install, so no. i'd like to though.

            Well, I had to take half the memory out, but I managed to get it installed and building kernels.

            I enabled mpi, and it Just Worked(tm). It's kinda fun having an o2 with fibrechannel and a 130gig volume attached to it.

Credits

Copyright © - Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]