OpenBSD Journal

Developer blog: dlg: arc(4)

Contributed by dlg on from the too-much-storage-is-never-enough dept.

A little under a month ago I committed the start of a driver called arc(4) that was intended to support the SATA RAID controllers produced by Areca Technology Corporation armed only with a single controller, some minimal documentation, and the FreeBSD driver. Less than two weeks after that I had IO working on the controller, and as of a few hours ago, useful bioctl support. If you're after cheap hardware RAID, this controller is worth a look.

I can't really explain why I started working on this controller. I think I was looking for a distraction from mpi, which I'd been working on for way too long (I started mpi at the end of last year if that's any indication of why I might have been sick of it). I had stuck the arc into a sparc64 and added the pcidevs entries a while back, and then I'd just left it there, mocking me from the dmesg with its not configured line.

So I wrote a tiny stub for it and discovered some old cruft in the tree called if_arc. if_arc was a networking layer, but a totally unused one. The problem was when I configured an arc device in my kernel, that defined a symbol in the kernel configuration called "arc" (funnily enough) which was originally a flag used to enable the arc network layer. So creating and enabling my arc pci driver (which did nothing except attach) caused if_arc to be compiled, which in turn broke things. So I deleted the if_arc code and slacked some more.

Eventually I got sick of my stub driver doing nothing, so I emailed Areca asking for documentation. After a few rounds with people at Areca, they eventually gave me documentation from Intel for the IOP331 and related chipsets, and the source code for the freebsd driver. Both of these are publically available, but they did offer to help me with any problems I was having. My conclusion is that they don't have any real hardware level documentation for their product.

Fortunately the initialisation and IO paths for these controllers are insanely simple, so it wasn't hard to figure out how they worked, even from FreeBSD code. You simply map the register space described in the IOP331 doco, submit a single command and get back the operational characteristics of the controller. The most important being how many simultaneous commands the hardware can take at one time. After that you simple allocate enough memory for these commands and let the midlayer do SCSI commands.

The API's in FreeBSD are quite different to OpenBSD. It is definitely a lot easier to pull a driver from NetBSD than it is from anywhere else (and vice versa as mpi has recently shown). The interface to the hardware (the PCI layer) is different, along with the SCSI midlayers, and the way DMA memory is dealt with. These three things make it hard to pull code from FreeBSD directly and safely.

So I started from scratch, but pulled in a lot of bits from mpi (which in turn pulled in bits from ami).

arc(4) has the shortest and most beautiful path for SCSI IO of any controller I have seen. They even managed to keep the scatter gather lists simple. You simply take a SCSI command from the midlayer, map the data buffer into DMAable memory, fill in the arc IO command with the appropriate values, write the SGL out, submit it to the hardware, and then wait for a response.

This is different from the other RAID drivers I have worked on where the hardware doesn't actually understand SCSI commands. In those cases we've had to emulate SCSI inside the driver and translate those commands into appropriate commands that are specific to that HBA. The other HBA's also had a bad habit of complicating the command structures unnecessarily, the arc one is amazingly simple. If anyone out there wants to know the basics needed for a SCSI driver, I would definitely recommend reading the setup (arc_attach) and the IO path (arc_scsi_cmd, arc_scsi_cmd_done) in arc.c.

The fact that arc(4) is a RAID controller that understands pure SCSI commands is worth mentioning. More recently it seems that RAID vendors are starting to catch on that SCSI is the language of storage, so their HBAs are beginning to take more and more SCSI commands natively. However, the Areca controllers are already there. This is the first RAID controller I've worked on that uses SCSI commands 100% in the IO path. The closest I've seen is mfi(4), which takes SCSI commands for basically everything except reads and writes.

I would like to point out that even though the commit log says it took me two weeks to get IO going on the HBA, it would have taken a lot less time if I'd had disks to plug into it. Or maybe I'm a slacker, I'm not sure.

IO works disgustingly well now. Doing sequential IO from a simple two disk stripe I'm able to sustain about 150MBps. With a 4 disk RAID5 I was doing 230MBps of sequential reads. Unfortunately completely random IO doesn't hold up well, but I think that's more an issue with SATA than the HBA itself.

After getting IO going I shifted my focus onto bio(4) support. I wish I could say as many nice things about the interface for firmware commands on arc as I can about its IO path. Let's just say it is insane. To get a command to the hardware, you have to prepend it with a 5 (yes, that's five) byte header, and put a 1 byte checksum on the end. This ends up making alignment a pain in the arse, so you end up allocating buffers for the commands twice, and copying the data around 2 or 3 times. You have to post these messages in 124 byte chunks, and there is no mechanism to ensure that the reply you're pulling off the hardware is for the command you just sent. The code I've written to deal with this isn't some I'm proud of. However, it does work and it is fairly safe and robust. It's just a lot more pain and work than it should be. Don't use it as a reference.

If the interface wasn't bad enough, the messages used to query the disks and volumes don't fit well to the bio(4) interface that bioctl(8) uses. I'm still not decided if that is a problem with the areca firmware layout or if it's a problem with our bio layer.

The problem is bio was built around ami, and ami enumerates all its configured volumes and disks in a 0 based array with no gaps. So bioctl goes "hi, how many volumes can you got?" and ami replies "hey, I have 3 volumes". bioctl then goes "cool, give me volumes 0, 1, and 2", which ami is quite happy about.

arc on the other hand doesn't have a nice easy answer to the "how many volumes have you got?" question. Instead it knows it can have up to 16 volumes, which isn't quite the same. And on top of that, these volumes aren't sequentially listed from 0 to 15 in that space of 16 volumes, instead they can exist anywhere in that space. When I was writing this code I had set the HBA up so the first volume was at 1, which made bioctl a bit unhappy.

There is a ton of code that works around this issue, and others like it. Again, this code isn't something I'm proud of, but it does work, and it seems to work quite well.

One thing to note is that there is an extra layer in the Areca firmware between the disks and the volumes. You group disks into raid sets, and then build volumes out of those raid sets. For example, I have my hba set up so there are three disks configured into a single raid set. On top of that I have built 3 volumes of varying types, which is kinda cool, especially for testing bioctl.

Anyway, I know you're all keen to see it in action, so here's some gratuitious pasting. First, here's the bits from dmesg:

ppb3 at pci3 dev 8 function 0 "Intel IOP331 PCIX-PCIX" rev 0x0a
pci6 at ppb3 bus 6
arc0 at pci6 dev 14 function 0 "Areca ARC-1110" rev 0x00: irq 11
arc0: 4 SATA Ports, 128MB SDRAM, FW Version: V1.41 2006-5-24
scsibus3 at arc0: 16 targets
sd2 at scsibus3 targ 0 lun 0: <Areca, ARC-1110-VOL#00, R001> SCSI3 0/direct fixed
sd2: 122070MB, 43402 cyl, 12 head, 480 sec, 512 bytes/sec, 249999360 sec total
sd3 at scsibus3 targ 1 lun 0: <Areca, ARC-1110-VOL#01, R001> SCSI3 0/direct fixed
sd3: 122070MB, 43402 cyl, 12 head, 480 sec, 512 bytes/sec, 249999360 sec total
sd4 at scsibus3 targ 3 lun 0: <Areca, ARC-1110-VOL#02, R001> SCSI3 0/direct fixed
sd4: 122070MB, 43402 cyl, 12 head, 480 sec, 512 bytes/sec, 249999360 sec total

Here's the bioctl output when everything is cool:

loki@i386 man4$ sudo bioctl arc0
Password:
Volume  Status     Size           Device  
 arc0 0 Online       127999672320 sd2     RAID5
      0 Online       320072933376 0:0.0   noencl <ST3320620AS 3.AAD>
      1 Online       320072933376 0:1.0   noencl <ST3320620AS 3.AAD>
      2 Online       320072933376 0:2.0   noencl <ST3320620AS 3.AAC>
 arc0 1 Online       127999672320 sd3     RAID0
      0 Online       320072933376 0:0.0   noencl <ST3320620AS 3.AAD>
      1 Online       320072933376 0:1.0   noencl <ST3320620AS 3.AAD>
      2 Online       320072933376 0:2.0   noencl <ST3320620AS 3.AAC>
 arc0 2 Online       127999672320 sd4     RAID1
      0 Online       320072933376 0:0.0   noencl <ST3320620AS 3.AAD>
      1 Online       320072933376 0:1.0   noencl <ST3320620AS 3.AAD>
      2 Online       320072933376 0:2.0   noencl <ST3320620AS 3.AAC>

Let's check if the alarm is enabled, and if so, disable it. I don't want to wake people up.

loki@i386 man4$ sudo bioctl -a get arc0
alarm is currently enabled
loki@i386 man4$ sudo bioctl -a disable arc0
loki@i386 man4$ sudo bioctl -a get arc0     
alarm is currently disabled

Let's pull a disk:

loki@i386 loki$ sudo bioctl arc0 
Volume  Status     Size           Device  
 arc0 0 Degraded     127999672320 sd2     RAID5
      0 Offline                 0 1:0.0   noencl <disk missing>
      1 Online       320072933376 0:1.0   noencl <ST3320620AS 3.AAD>
      2 Online       320072933376 0:2.0   noencl <ST3320620AS 3.AAC>
 arc0 1 Offline      127999672320 sd3     RAID0
      0 Offline                 0 1:0.0   noencl <disk missing>
      1 Online       320072933376 0:1.0   noencl <ST3320620AS 3.AAD>
      2 Online       320072933376 0:2.0   noencl <ST3320620AS 3.AAC>
 arc0 2 Degraded     127999672320 sd4     RAID1
      0 Offline                 0 1:0.0   noencl <disk missing>
      1 Online       320072933376 0:1.0   noencl <ST3320620AS 3.AAD>
      2 Online       320072933376 0:2.0   noencl <ST3320620AS 3.AAC>

Unfortunately the arc firmware isn't aware of the disk when it's been unplugged, so I have to fake one for bioctl output. I put it on bus 1, which doesn't exist, so it won't be confused with a real disk that may be plugged in. I'm not sure if I should keep the "disk missing" bit though...

Let's plug the disk back in:

loki@i386 loki$ sudo bioctl arc0 
Volume  Status     Size           Device  
 arc0 0 Rebuild      127999672320 sd2     RAID5 1% done
      0 Online       320072933376 0:0.0   noencl <ST3320620AS 3.AAD>
      1 Online       320072933376 0:1.0   noencl <ST3320620AS 3.AAD>
      2 Online       320072933376 0:2.0   noencl <ST3320620AS 3.AAC>
 arc0 1 Online       127999672320 sd3     RAID0
      0 Online       320072933376 0:0.0   noencl <ST3320620AS 3.AAD>
      1 Online       320072933376 0:1.0   noencl <ST3320620AS 3.AAD>
      2 Online       320072933376 0:2.0   noencl <ST3320620AS 3.AAC>
 arc0 2 Degraded     127999672320 sd4     RAID1
      0 Online       320072933376 0:0.0   noencl <ST3320620AS 3.AAD>
      1 Online       320072933376 0:1.0   noencl <ST3320620AS 3.AAD>
      2 Online       320072933376 0:2.0   noencl <ST3320620AS 3.AAC>

You can see it's put the disk back into the raid set and it's begun to rebuild onto it. Once sd2 finishes rebuilding it will move onto doing sd4. Since sd3 was a stripe, it was basically destroyed when the disk was unplugged. It becomes available again, but the data will be lost.

Let's skip forward to the rebuilding of sd4 a few hours later:

loki@i386 loki$ sudo bioctl arc0
Password:
Volume  Status     Size           Device  
 arc0 0 Online       127999672320 sd2     RAID5
      0 Online       320072933376 0:0.0   noencl <ST3320620AS 3.AAD>
      1 Online       320072933376 0:1.0   noencl <ST3320620AS 3.AAD>
      2 Online       320072933376 0:2.0   noencl <ST3320620AS 3.AAC>
 arc0 1 Online       127999672320 sd3     RAID0
      0 Online       320072933376 0:0.0   noencl <ST3320620AS 3.AAD>
      1 Online       320072933376 0:1.0   noencl <ST3320620AS 3.AAD>
      2 Online       320072933376 0:2.0   noencl <ST3320620AS 3.AAC>
 arc0 2 Rebuild      127999672320 sd4     RAID1 9% done
      0 Online       320072933376 0:0.0   noencl <ST3320620AS 3.AAD>
      1 Online       320072933376 0:1.0   noencl <ST3320620AS 3.AAD>
      2 Online       320072933376 0:2.0   noencl <ST3320620AS 3.AAC>

After that, it looks just like it did before I unplugged the disk.

There is actually a 4th disk hooked up to the hba, but I haven't started work on adding hotspares yet. This won't be nice either because of bioctl being set up to query the hardware for extra disks in a fashion not comfortable for arc, but I'll make something work.

Unfortunately I don't know anyone with this hardware, so if you do have this card, please do give this code a go and let me know what you think.

I would like to thank Kevin Reay for the controller, and Theo de Raadt for helping me out with gear. Mention must go to Erich Chen and Billion Wu at Areca for helping me out with my questions. I'd also like to thank Deanna Phillips, Marco Peereboom, and the guys at DiGiCOR Brisbane.

(Comments are closed)


Comments
  1. By jorge (201.232.11.168) on

    Hi there I am not kernel programmer but I would like to know why the api is so different from FreeBSD and why they/you did that? just a quick tip, I've been google but nothing useful so far...
    Thanks in advance

    Comments
    1. By tedu (71.139.173.104) on

      > Hi there I am not kernel programmer but I would like to know why the api is so different from FreeBSD and why they/you did that? just a quick tip, I've been google but nothing useful so far...

      things change over time. freebsd changes it one way, netbsd or openbsd change it another way.

  2. By Anonymous Coward (70.27.15.123) on

    Where exactly are these controllers cheap? The only prices I found were much more expensive than LSI controllers.

    Comments
    1. By Anonymous Coward (81.173.31.9) on

      > Where exactly are these controllers cheap? The only prices I found were much more expensive than LSI controllers.

      http://www.webconnexxion.com

      Dutch company, ships tax free overseas. 1210 will set you back $368, with the benchmarks i've seen though i'd happily pay more for areca than LSI because they're faster.

      http://tweakers.net/reviews/557/23
      http://tweakers.net/reviews/557/24
      http://tweakers.net/reviews/557/25
      http://tweakers.net/reviews/557/26

      Comments
      1. By sthen (81.168.66.230) on

        > http://tweakers.net/reviews/557/23

        Old LSI 150-4 (66MHz bus, no BBU) compared with a new Areca (133MHz bus, BBU). That's fair, eh?...

        Still looks like the Areca is faster than the 300-8x, at least for seq RAID5 reads, but if they've gone to the trouble of benchmarking, they could at least get some more up-to-date cards to compare with...

    2. By Simon Dassow (213.128.132.194) janus (at) errornet (dot) de on http://janus.errornet.de

      > Where exactly are these controllers cheap? The only prices I found were much more expensive than LSI controllers.

      Besides Areca those chips are built into Tekram controllers...
      and they're cheap if you look at the models with more than a few channels.

      Comments
      1. By Anonymous Coward (81.173.31.9) on

        > > Where exactly are these controllers cheap? The only prices I found were much more expensive than LSI controllers.
        >
        > Besides Areca those chips are built into Tekram controllers...
        > and they're cheap if you look at the models with more than a few channels.

        I'm pretty sure Tekram are crappy cards that happen to use the areca chip, they are no substitute if your after areca performance.

        Comments
        1. By Brad (204.101.180.70) on

          > > Besides Areca those chips are built into Tekram controllers...
          > > and they're cheap if you look at the models with more than a few channels.
          >
          > I'm pretty sure Tekram are crappy cards that happen to use the areca chip, they are no substitute if your after areca performance.

          I could be wrong but it looks as if Tekram is just reselling Areca gear.

  3. By Anonymous Coward (198.208.251.24) on

    Thanks for the extraordinarily lengthy and informitive entry!!

Latest Articles

Credits

Copyright © - Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]