Contributed by Pedro Martelletto on from the filesystems whee dept.
As you may have noticed, I've been slowly adding UFS2 (which I will
refer to as 'FFS2' from now on) bits to the tree. That has caused some
confusion, so here are a few clarifying notes about it.
Unfortunately, having FFS2 in-tree doesn't yet mean you will be able to deploy that evil > 1TB plan you had in mind to store all your porn, er, I mean, data. What's being done right now is just basic instrumentation, which means the code will be there the day we decide to use it.
There are quite a few issues that need to be addressed before we can
take full advantage of FFS2. The first and most urgent of them all is a quick and fast replacement for fsck. There's absolutely no point in
supporting insanely huge file systems that will take a month to fsck.
Now that VFS seems to be stable, this is priority #1 for me.
During the last Hackathon some time was devoted to the discussion of how to tackle this problem. The alternatives are not many and well known, one of them (softdep) being already in-tree for a while. However, it has corner cases, such as resource leaking in the form of unreclaimed inodes and blocks, thus not really eliminating the need for fsck. A background 'mark and sweep' garbage collection daemon to be run upon mount was proposed as a possible solution, and is currently under study. The main problem with such approach is how to pace it with userland. The process of checking if a given resource is allocated, updating metadata and flushing it back to disk would have to be done in a way not to leave the disk or the kernel in an inconsistent state, nor take over the CPU or crawl to death. But it definitely would be a very robust solution.
The other pending issues for FFS2 are making userland tools able to grok it and supporting larger disks. The former means making newfs, fsck, growfs and tunefs able to operate on FFS2 file systems. The latter means bumping daddr_t to 64 bits and having disklabel cope with the changes. This is needed so disk blocks at offsets > 32 bits can be read in, kept in the buffer cache and dealt with as 'ordinary' buffers. Getting FFS2 in allows us to do that, as it puts an end in the chicken-and-egg problem concerning daddr_t and FFS where the first wouldn't be bumped because the second didn't support it, and the second wouldn't be updated because the first hadn't been bumped.
Hopefully adding these bits will get some attention, and perhaps attract more people to join me on FS hacking. Cause that'll be a lot of work.
(Comments are closed)
By smith (66.63.143.34) smith@confuciun.com on
Thanks for all these blog like entries from the developers, it's really interesting.
Comments
By pedro@ (201.17.60.11) on
By Anonymous Coward (156.34.223.129) on
Comments
By pedro@ (201.17.60.11) on
While it may have been implicit, would you mind explaining exactly why do you think that's to be avoided, and how that could be achieved without compromising the file system's integrity?
Comments
By Anonymous Coward (156.34.223.129) on
In the rambling that *might* someday be pertinant to your efforts, I guess I was just expressing my hope that any background 'mark and sweep' garbage collection daemons be intelligent enough to consider power-saving issues when running on battery power on laptop.
The other issues aren't filesystem related. I should have thought more before I typed so I won't bother to elaborate on them. I'm going to look at my 'data' now. Excuse me...
By Fábio Olivé Leite (200.213.25.66) on
Comments
By pedro@ (201.17.60.11) on
Yes, that's correct.
By Anonymous Coward (66.12.209.212) on
Hasn't McKusick dealt with this with FFS snapshots and the new background fsck he put together for FreeBSD 5?
Comments
By Michael Knudsen (217.157.199.114) on
By Antonios Anastasiadis (213.5.63.117) on
By ike (216.254.76.237) ike@lesmuug.org on http://nycbug.org/
Thanks for posting your thoughts about filesystem design on undeadly.
I don't know where your at with the design/strategy process your going through, but I immeadiately thought of the following book, (available in this url as a pdf document):
"Practical File System Design with the Be File System"
http://www.letterp.com/~dbg/practical-file-system-design.pdf
--
The first 50 or so pages of the PDF cover historically, a number of contemporary filesystem designs, with their respective goals clearly stated- and a frank discussion of the strong and weak points of each design.
The rest of the book is about the BeFs, which may or may not be as relevant- (I personally think it's an exquisite design, though not necessarily relevant with regard to the community needs of OpenBSD, or *BSD for that matter.)
Noteworthy- the author discusses Journaling, and alternattives, at length in a high-level (and enjoyable) manner. (It's a fun read if you dig filesystem and database design man.)
--
Just as an FYI, this seems to be the current place information is collected for 'Bigdisk' stuff on FreeBSD:
http://www.freebsd.org/projects/bigdisk/index.html
Aside from that, good luck!
Best,
.ike
Comments
By Anonymous Coward (68.148.1.194) on
do you think we don't do the research? Or the searches, or the
background checks? Yes, it is nice to point out (yet again the
obvious) resources, but really, why not just pick up a keyboard
and start coding? :)
Comments
By ike (216.254.76.237) ike@lesmuug.org on http://nycbug.org/
To the 'anonymous coward' who railed me here:
Was just trying to provide some resources that have been very usefull to my thoughts when I'm dealing with related persistent data problems, sorry if thats a bother here.
By Anonymous Coward (82.43.92.127) on
Comments
By Fábio Olivé Leite (200.213.25.66) on
Heh, two weeks ago I saw a friend screw his notebook's harddisk the second time because of Linux's ReiserFS (which we end up calling HellRaiserFS), and then I patted myself on the back again for switching from Linux to OpenBSD three years ago. OpenBSD may lack some "cool" features, but at least I can trust my data to it.
And I speak as the person who actually got ReiserFS into Conectiva Linux (largest latin american linux vendor, now part of Mandriva) six years ago. At that time I had a lot of faith in it. It eventually destroyed all of my faith, with or without journal.
On the other hand, my OpenBSD notebook locked up hard about five or more times last week because of faulty memory (it's going back to the shop tomorrow), and yet even though I was either compiling GENERIC, making build or doing large file transfers when it locked up, I didn't lose any data. Of course I did have some inodes pop up on lost+found once or twice, but I thank softdep for keeping my data safe even then.
Comments
By Anonymous Coward (193.63.217.208) on
Your example of ReiserFS failing tells me that ReiserFS journaling is not to be trusted (poor implementation?) rather than journaling is not to be trusted. If anyone can do journaling *right* and allow us to wave a far-from-fond farewell to fsck after a crash then I think it is the OpenBSD team.
Comments
By pedro@ (139.82.36.138) on
Comments
By pedro@ (139.82.36.138) on
Or in a sentence: if you want journaling, you tell us why you want it, otherwise we won't be able to refute nor evaluate your proposal.
By Anonymous Coward (66.11.66.41) on
No it is not. Its really tiring seeing this slashdot tardspeak all the time. Journalling does not have magic powers to prevent you from losing data. All journalling does is allow you to replay actions in the log to get the filesystem back into a "consistant state". Consistant just means the filesystem is clean and has no weird inconsistancies like a file with no links not being marked as free. It does nothing to ensure that you don't lose data.
There are other ways to ensure you are in a consistant state besides replaying a journal. And they don't introduce the additional overhead that recording all operations in a journal does. One is fsck, but it takes a long time before you can use your filesystem. Another is bgfsck, which takes even longer, but lets you use the filesystem while its doing it.
Just because these are the traditional approaches, doesn't mean there aren't other approaches that may be better.
Comments
By Anonymous Coward (66.11.66.41) on
By Anonymous Coward (67.64.89.177) on
By Anonymous Coward (70.162.91.58) on
Or is there a larger philosophical reason OBSD chooses not to even touch those filesystems?
Comments
By Anonymous Coward (67.64.89.177) on
Comments
By Anonymous Coward (149.169.255.239) on
Comments
By Anonymous Coward (143.166.255.17) on
By tedu (69.12.168.114) on