Developer blog: pedro

Contributed by Pedro Martelletto on 2006-03-17 from the filesystems whee dept.

Some notes regarding UFS2

As you may have noticed, I've been slowly adding UFS2 (which I will refer to as 'FFS2' from now on) bits to the tree. That has caused some confusion, so here are a few clarifying notes about it.

Unfortunately, having FFS2 in-tree doesn't yet mean you will be able to deploy that evil > 1TB plan you had in mind to store all your porn, er, I mean, data. What's being done right now is just basic instrumentation, which means the code will be there the day we decide to use it.

There are quite a few issues that need to be addressed before we can take full advantage of FFS2. The first and most urgent of them all is a quick and fast replacement for fsck. There's absolutely no point in supporting insanely huge file systems that will take a month to fsck. Now that VFS seems to be stable, this is priority #1 for me.

During the last Hackathon some time was devoted to the discussion of how to tackle this problem. The alternatives are not many and well known, one of them (softdep) being already in-tree for a while. However, it has corner cases, such as resource leaking in the form of unreclaimed inodes and blocks, thus not really eliminating the need for fsck. A background 'mark and sweep' garbage collection daemon to be run upon mount was proposed as a possible solution, and is currently under study. The main problem with such approach is how to pace it with userland. The process of checking if a given resource is allocated, updating metadata and flushing it back to disk would have to be done in a way not to leave the disk or the kernel in an inconsistent state, nor take over the CPU or crawl to death. But it definitely would be a very robust solution.

The other pending issues for FFS2 are making userland tools able to grok it and supporting larger disks. The former means making newfs, fsck, growfs and tunefs able to operate on FFS2 file systems. The latter means bumping daddr_t to 64 bits and having disklabel cope with the changes. This is needed so disk blocks at offsets > 32 bits can be read in, kept in the buffer cache and dealt with as 'ordinary' buffers. Getting FFS2 in allows us to do that, as it puts an end in the chicken-and-egg problem concerning daddr_t and FFS where the first wouldn't be bumped because the second didn't support it, and the second wouldn't be updated because the first hadn't been bumped.

Hopefully adding these bits will get some attention, and perhaps attract more people to join me on FS hacking. Cause that'll be a lot of work.

(Comments are closed)

Comments

By smith (66.63.143.34) smith@confuciun.com on 2006-03-17 20:54

can you explain more about the VFS?

Thanks for all these blog like entries from the developers, it's really interesting.
Comments
1. By pedro@ (201.17.60.11) on 2006-03-17 21:01
  
  Sure, do some read-up (in case you haven't) and feel free to mail me whatever questions you have.
By Anonymous Coward (156.34.223.129) on 2006-03-17 22:35

Perhaps it is a completely unrelated matter, but I'm a little worried about a background system that always keeps the disk spun up ... or worse, lets it spin down for a short time and then spins it up again. I find even now disks will spin back up even if a mere disk read is performed that (should) be in the cache, or sometimes for no reason at all that I can ascertain. It would be a real bonus for laptops in particular such behavior could be avoided.
Comments
1. By pedro@ (201.17.60.11) on 2006-03-17 23:24
  
  Hi,
  
  While it may have been implicit, would you mind explaining exactly why do you think that's to be avoided, and how that could be achieved without compromising the file system's integrity?
  Comments
  1. By Anonymous Coward (156.34.223.129) on 2006-03-18 01:13
    
    Hmm ... probably just poorly thought out commentary on my part -- my keyboard's fault -- too little resistence in the keys =)
    
    In the rambling that *might* someday be pertinant to your efforts, I guess I was just expressing my hope that any background 'mark and sweep' garbage collection daemons be intelligent enough to consider power-saving issues when running on battery power on laptop.
    
    The other issues aren't filesystem related. I should have thought more before I typed so I won't bother to elaborate on them. I'm going to look at my 'data' now. Excuse me...
2. By F�bio Oliv� Leite (200.213.25.66) on 2006-03-19 19:26
  
  As I see it the background daemon would be ran just for filesystem recovery after a crash or similar event that might leave the disk inconsistent. If a filesystem is unmounted correctly there should be no need for that daemon to run. And even in the case where filesystem recovery should be performed, the daemon would run just until the filesystem is clean. After that the filesystem code should surely keep the filesystem consistent, or else it has bugs. Is that right?
  Comments
  1. By pedro@ (201.17.60.11) on 2006-03-19 21:28
    
    Oi F�bio,
    
    Yes, that's correct.
By Anonymous Coward (66.12.209.212) on 2006-03-18 04:07

A background 'mark and sweep' garbage collection daemon to be run upon mount was proposed as a possible solution, and is currently under study. The main problem with such approach is how to pace it with userland. The process of checking if a given resource is allocated, updating metadata and flushing it back to disk would have to be done in a way not to leave the disk or the kernel in an inconsistent state, nor take over the CPU or crawl to death. But it definitely would be a very robust solution.

Hasn't McKusick dealt with this with FFS snapshots and the new background fsck he put together for FreeBSD 5?
Comments
1. By Michael Knudsen (217.157.199.114) on 2006-03-18 10:05
  
  This is one of the things that Pedro said was difficult to do right. Either you hog the entire system for a (relatively short) while, or you do it really slowly and it takes forever. Yes, there is a grey area in between, but finding the right shade and keeping it is far from trivial.
By Antonios Anastasiadis (213.5.63.117) on 2006-03-18 08:32

THANK YOU everyone who takes the time to write down some of his thoughts into a blog for the rest of us users. They are very interesting and enlightening. It is really,really,really cool to read such stuff. Keep up the good work (and the blogging!)
By ike (216.254.76.237) ike@lesmuug.org on 2006-03-18 18:32 http://nycbug.org/

Hi Marco,

Thanks for posting your thoughts about filesystem design on undeadly.

I don't know where your at with the design/strategy process your going through, but I immeadiately thought of the following book, (available in this url as a pdf document):

"Practical File System Design with the Be File System"
http://www.letterp.com/~dbg/practical-file-system-design.pdf

--
The first 50 or so pages of the PDF cover historically, a number of contemporary filesystem designs, with their respective goals clearly stated- and a frank discussion of the strong and weak points of each design.

The rest of the book is about the BeFs, which may or may not be as relevant- (I personally think it's an exquisite design, though not necessarily relevant with regard to the community needs of OpenBSD, or *BSD for that matter.)

Noteworthy- the author discusses Journaling, and alternattives, at length in a high-level (and enjoyable) manner. (It's a fun read if you dig filesystem and database design man.)

--
Just as an FYI, this seems to be the current place information is collected for 'Bigdisk' stuff on FreeBSD:
http://www.freebsd.org/projects/bigdisk/index.html

Aside from that, good luck!

Best,
.ike
Comments
1. By Anonymous Coward (68.148.1.194) on 2006-03-18 19:16
  
  Pedro != Marco, and things go downhill after that. Seriously people,
  do you think we don't do the research? Or the searches, or the
  background checks? Yes, it is nice to point out (yet again the
  obvious) resources, but really, why not just pick up a keyboard
  and start coding? :)
  Comments
  1. By ike (216.254.76.237) ike@lesmuug.org on 2006-03-18 19:52 http://nycbug.org/
    
    Hey Pedro- sorry to mix up your name, mental typo.
    
    To the 'anonymous coward' who railed me here:
    Was just trying to provide some resources that have been very usefull to my thoughts when I'm dealing with related persistent data problems, sorry if thats a bother here.
By Anonymous Coward (82.43.92.127) on 2006-03-19 23:05

Is this the time to implement journaling in the filesystem instead of relying on fsck which scales linearly in time with filesystem size?
Comments
1. By F�bio Oliv� Leite (200.213.25.66) on 2006-03-20 00:24
  
  Nope! Journaling filesystems have traditionally been regarded as the wrong solution to the consistency-after-a-crash problem on BSD land. That's why we have soft dependencies (softdep on mount options), which is usually regarded as a more elegant solution.
  
  Heh, two weeks ago I saw a friend screw his notebook's harddisk the second time because of Linux's ReiserFS (which we end up calling HellRaiserFS), and then I patted myself on the back again for switching from Linux to OpenBSD three years ago. OpenBSD may lack some "cool" features, but at least I can trust my data to it.
  
  And I speak as the person who actually got ReiserFS into Conectiva Linux (largest latin american linux vendor, now part of Mandriva) six years ago. At that time I had a lot of faith in it. It eventually destroyed all of my faith, with or without journal.
  
  On the other hand, my OpenBSD notebook locked up hard about five or more times last week because of faulty memory (it's going back to the shop tomorrow), and yet even though I was either compiling GENERIC, making build or doing large file transfers when it locked up, I didn't lose any data. Of course I did have some inodes pop up on lost+found once or twice, but I thank softdep for keeping my data safe even then.
  Comments
  1. By Anonymous Coward (193.63.217.208) on 2006-03-20 09:49
    
    Why is journaling "traditionally" viewed as not being the answer? Data loss on crash is exactly the problem journalling is there to solve. Softdeps (which I use on my OpenBSD systems) seems fine while a system is running but does nothing about the crushingly long fsck times. I only have a 250GB file system currently, UFS2 stretches into TB.
    
    Your example of ReiserFS failing tells me that ReiserFS journaling is not to be trusted (poor implementation?) rather than journaling is not to be trusted. If anyone can do journaling *right* and allow us to wave a far-from-fond farewell to fsck after a crash then I think it is the OpenBSD team.
    
    Comments
    
    By pedro@ (139.82.36.138) on 2006-03-20 11:03
    
    Perhaps my text wasn't very clear, but a new solution can be built upon softdep that won't require a 'traditional' fsck, even if it's in background (and renders the machine unusable). There's no reason we have to choose between softdep + what Free/NetBSD call 'bgfsck' and a full journaled FS.
    
    Comments
    
    By pedro@ (139.82.36.138) on 2006-03-20 11:18
    
    However, as it ought to have been implicit in the text above, we're still open to suggestions, so you (and everyone else) are welcome to submit ideas (especially in the form of code), even if they are about journaling or some other technology you'd like to propose, but please do so with arguments.
    
    Or in a sentence: if you want journaling, you tell us why you want it, otherwise we won't be able to refute nor evaluate your proposal.
    
    By Anonymous Coward (66.11.66.41) on 2006-03-20 17:36
    
    "Data loss on crash is exactly the problem journalling is there to solve."
    
    No it is not. Its really tiring seeing this slashdot tardspeak all the time. Journalling does not have magic powers to prevent you from losing data. All journalling does is allow you to replay actions in the log to get the filesystem back into a "consistant state". Consistant just means the filesystem is clean and has no weird inconsistancies like a file with no links not being marked as free. It does nothing to ensure that you don't lose data.
    
    There are other ways to ensure you are in a consistant state besides replaying a journal. And they don't introduce the additional overhead that recording all operations in a journal does. One is fsck, but it takes a long time before you can use your filesystem. Another is bgfsck, which takes even longer, but lets you use the filesystem while its doing it.
    
    Just because these are the traditional approaches, doesn't mean there aren't other approaches that may be better.
    
    Comments
    
    By Anonymous Coward (66.11.66.41) on 2006-03-20 17:38
    
    Oops, obviously that was meant to be a reply to 193.63.217.208, not pedro. It would be nice if we could see the text of the post we are replying to above this reply form...
    
    By Anonymous Coward (67.64.89.177) on 2006-03-20 13:41
    
    To fo journaling right one needs hardware assist. Think NVRAM here. The one company I am aware that got it right was NetApp. I have lost data on every linux fs out there.
  2. By Anonymous Coward (70.162.91.58) on 2006-03-20 13:17
    
    I know I'm an ass as I say this because I should just be "coding it myself", but would OpenBSD benefit by being able to read all the filesystems out there, even if it can't write. I can think of numerous times I'd love to throw a disk from some random OS into my OBSD box to save the data off it but it won't mount.
    
    Or is there a larger philosophical reason OBSD chooses not to even touch those filesystems?
    
    Comments
    
    By Anonymous Coward (67.64.89.177) on 2006-03-20 13:40
    
    I never have this problem. What in the world, minus screwing around with hardware, are you doing that you need this feature?
    
    Comments
    
    By Anonymous Coward (149.169.255.239) on 2006-03-20 17:13
    
    I could see dual-booting as a good reason to be able to read the filesystems. Writing to them involves support on some grander scale, but reading doesn't sound terribly out of line - especially since most of that code is out there, though GPL.
    
    Comments
    
    By Anonymous Coward (143.166.255.17) on 2006-03-20 17:17
    
    Right. Screwing around with hardware.
    
    By tedu (69.12.168.114) on 2006-03-20 19:00
    
    the number of openbsd developers dual booting linux with 4 different filesystems is limited.

Latest Articles

Sat, Jul 05
- 08:17 KDE Plasma 6.4 has landed in OpenBSD (0)
- 08:13 Blink and you'll miss it! 4096 colours and flashing text on the console! (2)
- 08:08 Game of Trees Hub now taking signups for repository hosting (0)
Sat, Jun 28
- 05:57 Game of Trees 0.115 released (0)
Tue, Jun 24
- 07:48 Game of Trees 0.114 released (0)
- 07:23 Call for testing: bge/bnx/iavf/igc/ix/ixl/ngbe/pcn: ifq_restart() fix (0)
Mon, Jun 16
- 08:22 j2k25 hackathon report from kn@: installer, low battery, and more (0)
Fri, Jun 13
- 11:18 dhcpd(8): use UDP sockets instead of BPF (1)
Thu, Jun 12
- 12:32 clang(1)/llvm/lld(1) updated to version 19 (0)

Credits

Copyright © 2004-2008 Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to April 2nd 2004 as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]