OpenBSD Journal

Developer Blog - otto@: statvfs and the state of large disk support

Contributed by dwc on from the size-matters dept.

Otto Moerbeek (otto@) writes:

Last week I committed statvfs(3) support to OpenBSD 4.3-current. This is another step in large disk support, and I thought it would be nice to give an overview of the current state of affairs.

Large disks are disks that have more than 2TB capacity. Originally we had the following limitations:

  • disklabels could only handle up to 2TB disks and partitions.
  • filesystems could only be 1TB in size.
  • the in-kenel buffer layer could only handle 32-bit disk addresses.
  • the SCSI layer did not fully support 64-bit disk sector addresses.

Step by step all these barriers have been removed in OpenBSD 4.1, 4.2 and 4.3: FFS2 was introduced both in the GENERIC kernel and in userland: there are various tools like dump(8) and fsck_ffs(8) that manipulate on-disk data structures directly. The disklabel format has been adapted to allow for larger partitions and disks, the kernel buffer layer and filesystem code has been changed to use 64-bit disk sector addresses. The SCSI layer has been changed to allow inquiry of large disks.

All this means we now support large disks, partitions and filesystems. The statvfs(3) commits were one more step: the code that retrieves disk usage and related statistics had to be adapted too. This is more involved than you'd think: struct statfs needed to be expanded to allow for the larger blocks and files count, which in turn required some careful backward compatibility stuff. This being a bit tricky meant it did not make OpenBSD 4.3, alas. By extending struct statfs it has now become easy to support statvfs(3).

There are a few things to keep in mind when using large partitions and FFS2: in particular, checking a large filesystem requires a lot of memory. The largest factor is the number of inodes in the filesystem. The default block and fragment sizes cause a lot of inodes to be created, for large filesystems you want to enlarge both, so less inodes are created. Test things: you do not want to discover you cannot repair a filesystem because fsck need more than MAXDSIZE memory after the fact.

In the future, we would like to solve this problem by allowing some sort of background file system check.

Another thing to remember: the boot loaders and the install/upgrade kernel do not know FFS2. Do not use FFS2 for any filesystem touched by the install/upgrade process (e.g. /, /usr, /tmp and /var).

Also, not all controllers actually support large disks: ami(4) for example only allows logical volumes up to 2TB. This is a hardware restriction, not a driver restriction. Other hardware/driver combinations might have their own limitations.

Here's some dmesg lines, bioctl, disklabel and df output from my test system:

arc0 at pci2 dev 14 function 0 "Areca ARC-1120" rev 0x00: irq 10
arc0: 8 ports, 256MB SDRAM, firmware V1.42 2006-10-13
scsibus0 at arc0: 16 targets
sd0 at scsibus0 targ 0 lun 0:  SCSI3 0/direct fixed
sd0: 4291533MB, 67449 cyl, 511 head, 255 sec, 512 bytes/sec, 8789059584 sec total
$ sudo bioctl -h arc0
Volume  Status               Size Device  
 arc0 0 Online               4.1T sd0     RAID5
      0 Online               699G 0:0.0   noencl 
      1 Online               699G 0:2.0   noencl 
      2 Online               699G 0:3.0   noencl 
      3 Online               699G 0:4.0   noencl 
      4 Online               699G 0:5.0   noencl 
      5 Online               699G 0:6.0   noencl 
      6 Online               699G 0:7.0   noencl 
$ sudo disklabel sd0
# Inside MBR partition 3: type A6 start 63 size 199114390
# /dev/rsd0c:
type: SCSI
disk: SCSI disk
label: ARC-1120-VOL#00 
flags:
bytes/sector: 512
sectors/track: 63
tracks/cylinder: 255
sectors/cylinder: 16065
cylinders: 547093
total sectors: 8789059584
rpm: 10000
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0           # microseconds
track-to-track seek: 0  # microseconds
drivedata: 0 

16 partitions:
#                size           offset  fstype [fsize bsize  cpg]
  a:       8789059521               63  4.2BSD  65536 65536    1 
  c:       8789059584                0  unused      0     0      
$ df /big 
Filesystem  1K-blocks      Used     Avail Capacity  Mounted on
/dev/sd0a   4390189376   9137728 4161542208     0%    /big

Thanks for the great work, Otto!

(Comments are closed)


Comments
  1. By Matthew Dempsky (38.102.129.10) on

    Very cool. :-)

    After reading otto@'s comment about ami(4) and >2TB disks, I looked at the man page and saw no other mention of this. Will known limitations like these be documented once large disks are fully supported?

    Thanks!

    Comments
    1. By Otto Moerbeek (otto) on http://www.drijf.net

      > Very cool. :-)
      >
      > After reading otto@'s comment about ami(4) and >2TB disks, I looked at the man page and saw no other mention of this. Will known limitations like these be documented once large disks are fully supported?
      >
      > Thanks!

      Depends, in most cases, the data about the max size of a raid set should be documented by the vendor. Only if the drivers poses special restrictions it should be documented in the driver man page, imo.

      Comments
      1. By Matthew Dempsky (69.232.203.114) on

        That's a fair response. I've just been spoiled by OpenBSD's documentation. :-)

  2. By Niall O'Higgins (69.12.154.240) niallo@niallohiggins.com on http://niallohiggins.com

    Thanks for the update Otto - I'm sure this will clarify the current situation for many people. I didn't realise we could have >2T file systems yet :-)

    Minor correction to the article - isn't statvfs a section 2 manual page?

    Comments
    1. By Otto Moerbeek (otto) on http://www.drijf.net

      > Thanks for the update Otto - I'm sure this will clarify the current situation for many people. I didn't realise we could have >2T file systems yet :-)
      >
      > Minor correction to the article - isn't statvfs a section 2 manual page?

      Nope, I have adapted struct statfs, statvfs(3) is just a wrapper to the adapted statfs(2) call, not a syscall itself.

      Comments
      1. By Igor Sobrado (156.35.192.2) sobrado@ on

        > > Minor correction to the article - isn't statvfs a section 2 manual page?
        >
        > Nope, I have adapted struct statfs, statvfs(3) is just a wrapper to the adapted statfs(2) call, not a syscall itself.

        I think that Niall is saying that the .Dt macro in the manual page source code shows that statvfs(3) is in section 2 (System Calls), even if the manual page resides in the right section (3, Subroutines).

        By the way, thanks a lot for your excellent work on supporting large filesystems! It is a great improvement and will become more and more important in the next years as the disk sizes grow.

        Comments
        1. By Anonymous Coward (69.12.154.240) on

          > > > Minor correction to the article - isn't statvfs a section 2 manual page?
          > >
          > > Nope, I have adapted struct statfs, statvfs(3) is just a wrapper to the adapted statfs(2) call, not a syscall itself.
          >
          > I think that Niall is saying that the .Dt macro in the manual page source code shows that statvfs(3) is in section 2 (System Calls), even if the manual page resides in the right section (3, Subroutines).

          Yes thats what I am referring to.

          Comments
          1. By Otto Moerbeek (otto) on http://www.drijf.net

            > > > > Minor correction to the article - isn't statvfs a section 2 manual page?
            > > >
            > > > Nope, I have adapted struct statfs, statvfs(3) is just a wrapper to the adapted statfs(2) call, not a syscall itself.
            > >
            > > I think that Niall is saying that the .Dt macro in the manual page source code shows that statvfs(3) is in section 2 (System Calls), even if the manual page resides in the right section (3, Subroutines).
            >
            > Yes thats what I am referring to.
            >

            Oh, but that has been fixed for a few days already.

  3. By Anonymous Coward (129.222.50.21) on

    Any news about volume manager support on OpenBSD ?

  4. By Anonymous Coward (217.19.26.102) on

    Cool to see devvers explain the new tech on undealy.org !

Credits

Copyright © - Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]