Developer Blog: dlg: hacking on ami

Contributed by dlg on 2006-03-19 from the ramblings of a lunatic with a compiler dept.

I have a monster of a headache at the moment and no painkillers in the house to kill it with. With that in mind I think it would be a bad idea to keep cutting up ami right now, so instead I thought I'd try to explain what I've been doing.

mickey first brought ami into our tree about 5 years ago. It started life as a driver to support access to its logical disks, which is great, because that's what you buy the hardware to do. However, the way it does this is pretty cool. MegaRAID controllers aren't actually traditional scsi controllers, meaning that there is no scsi bus which volumes hang off, you don't send INQUIRY commands to the volumes, and you don't do scsi reads and writes to get data on and off them either. This is different to hardware like siop or isp or mpt. For those controllers you build scsi commands in memory and submit them to the controller to be sent on the bus and replied to by the devices on that bus. MegaRAID controllers just have logical disks and they have their own fairly simple command set. However, when you plug one into an OpenBSD box you'll see it attach scsi bits like this:

ami0 at pci5 dev 0 function 0 "AMI MegaRAID" rev 0x20: irq 10
ami0: Dell PERC3/DC, 64b/lhc, FW 198U, BIOS v3.35, 128MB RAM
ami0: 2 channels, 0 FC loops, 1 logical drives
scsibus0 at ami0: 40 targets
sd0 at scsibus0 targ 0 lun 0:  SCSI2 0/direct fixed
sd0: 34560MB, 34560 cyl, 64 head, 32 sec, 512 bytes/sec, 70778880 sec total

All that scsi stuff is emulated in the driver. Rather than create a new block device equivilent to sd and wd but just for ami, we emulate all the necessary scsi commands for treating its logical disks as block devices. It turns out that you only need to translate about 8 scsi commands into megaraid commands. Emulating those 8 commands works out to be a lot less code and work than support for a new block device in OpenBSD. If you have a look at ami_scsi_cmd in ami.c you'll see a big switch statement that turns requests from the scsi midlayer into MegaRAID commands.

Anyway, emulating a scsi device was about all ami did up until a year ago when marco@ decided to write a RAID management framework around ami. All of a sudden ami grew some new code paths that needed to issue MegaRAID commands. Once upon a time there were two paths that used megaraid commands: the ami attach routine (for querying the hardware) and the scsi emulation. marco's changes added support for the passthrough scsi bus and a series of ioctls for bioctl to query the controller with. Both of these paths used MegaRAID commands, and their addition caused some growth in the paths responsible for putting commands on the hardware and pulling them off.

This growth turned into some ugly and delicate code. The way the dmamaps were loaded for the logical and passthrough busses looked disgusting, and the error recovery was even worse for those commands. They were even more confused by the fact that the ioctl paths went right through the same code too. It worked, but it was delicate. On top of this we were being overzealous in our locking (which may affect throughput on some faster disks), we were potentially busywaiting on the hardware a lot, and the way we retry commands when the hardware was busy was limited to scsi commands. Oh, and I don't think we do enough to make sure the disks are synchronised on shutdown.

In the past week I've been cutting the code up to try and address these issues. I've been tightening up locking as I've been going. The big work has been splitting the code paths up so that each user of megaraid commands only has to deal with its own stuff.

Before there was one big path spread accross three functions that had tests to see if the command was for scsi or generic buffers in each function. Then it tried to submit it to the hardware. If that failed it did different things depending on whether it was scsi or not, or if it had memory associated with it or not. When it was done the completion path was very oriented to the scsi commands as well and left the rest to fallthrough. The ioctl completion path was an ugly dance with wakeups that left the ccb off the free list and hanging around until the process woke up.

Now there is a tiny chunk of code that is only used to submit commands to the hardware. All the different paths are responsible for their setup and only their setup, and then they submit the command to the hardware. Each code path into the submission sets a callback for the command thats called when the hardware finishes with it. This completion function is called for both polled and async commands now, which is handy in the scsi paths. Instead of having a small number of large functions for putting commands on the hardware, now we have a set of small and simple functions that do the work specific to their callers.

For example, the dmamap setup for the scsi commands was done in ami_cmd, which was called by everyone. Now that dmamap setup has been pushed back up into ami_scsi_cmd and ami_scsi_raw_cmd. The scsi timeout stuff that was handled in ami_start (which was only ever called by ami_cmd, which means eveything calls it) is now split into ami_start_xs, which is shared by both ami_scsi_cmd and ami_scsi_raw_cmd.

The other big change is how commands are queued onto the hardware. Previously every time you submit a command you busywait for the hardware to be ready, and then submit the command. If its a scsi command, and the hardware still isnt ready, we retry it out of a timeout, otherwise we just return with an error. The new behaviour is that every command gets stuck on a queue, no matter if its from an ioctl, a passthrough scsi command, or a logical scsi command. If the hardware is busy the first time we try to run a command, we simply retry the whole queue in a new timeout. This means we should spend less time busy waiting (which is admittedly very rare with megaraids) which will improve interactivity on the system.

So yeah, I tightened up some locking, split up and simplified the code paths for submitting commands, changed how commands are queued onto the hardware, and I got rid of some busy waits. I've still got some cleanup to do since I've been focussing mostly on the scsi paths. The ioctl paths need some attention and I have to shrink the ccbs since we dont use half of them anymore.

I have to say thanks to marco and theo for not totally freaking out when I started on these changes. Especially marco. These aren't trivial diffs and he's been remarkably good natured about them.

(Comments are closed)

Comments

By Anonymous Coward (80.108.115.184) on 2006-03-19 16:34

I have to admit I don't understand much of what you write, but I hope you feel better soon!
By Jim (68.250.26.213) on 2006-03-19 19:12

I have several LSI MegaRAID controllers deployed in production machines. I've been swapping out Adaptec 2810s ever since they (Adaptec) decided not to support 'me' the customer, and secondly the developers efforts to support them no matter what I said.

I'm very happy to see these improvements. It confirms my decision to migrate away from Adaptec.

Now where did I leave that purchasing card, I need to buy some more LSI to compliment the donation I made to the project. :)
By Anonymous Coward (70.176.20.6) on 2006-03-20 05:15

dlg, you the man. It's good to see this kind of code improvement happening all the time. I also think its cool to get a view into the background for these kinds of source changes. Thanks for taking the time to write.

DS

Latest Articles

Fri, Jul 11
- 09:15 watch(1) utility added to -current (0)
Sat, Jul 05
- 08:17 KDE Plasma 6.4 has landed in OpenBSD (0)
- 08:13 Blink and you'll miss it! 4096 colours and flashing text on the console! (2)
- 08:08 Game of Trees Hub now taking signups for repository hosting (0)
Sat, Jun 28
- 05:57 Game of Trees 0.115 released (0)
Tue, Jun 24
- 07:48 Game of Trees 0.114 released (0)
- 07:23 Call for testing: bge/bnx/iavf/igc/ix/ixl/ngbe/pcn: ifq_restart() fix (0)
Mon, Jun 16
- 08:22 j2k25 hackathon report from kn@: installer, low battery, and more (0)
Fri, Jun 13
- 11:18 dhcpd(8): use UDP sockets instead of BPF (1)

Credits

Copyright © 2004-2008 Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to April 2nd 2004 as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]