Developer Blog: niallo - OpenCVS Status Update and Request for Testing

Contributed by dwc on 2007-01-17 from the Check-This-Out! dept.

Niall O'Higgins (niallo@) writes:

People have been noticing the recent progress made in OpenCVS, and it's long past time I wrote something for undeadly.org. So, what is new and where is OpenCVS now?

I took a three week vacation to visit friends and family back in Ireland over the Christmas period. With much more free time on my hands, I decided to tackle some of the trickier issues which had been blocking us for some time. Namely, support for binary files and rewriting the RCS parser.

Binary file support was one of the major items preventing us from being able to do a full checkout of the OpenBSD 'src' CVS module. The 'src' module is where most of the project commits go, which makes it a great test case for OpenCVS. However, there are a number of binary format files in the repository. Large parts of the RCS API code were written under an incorrect assumption, namely that all data was ASCII, consisting of NUL-terminated C-strings. Functions such as strlen(), strtol(), strlcpy(), etc.., were all being used to deal with binary data, which would cause them to choke. We had done some work on making the RCS parser work with arbitrary binary data at the c2k6 hackathon, and this was used as a starting point.

The basic idea was to switch from using char pointers to hold the data and then dynamically calculating buffer length using strlen(), to u_char pointers and a separate size_t length variable associated with the buffer. Code was changed to consider the new length variable authoritative, and to use that value instead of getting it from a dynamic strlen() call.

This change allowed us to run a full checkout of the 'src' module, with some caveats. Specifically, memory usage was extremely high — so much so that execution would abort under normal login class heap size limits — and we were around one hundred times slower than GNU CVS at checking out certain files. Strangely enough, I didn't consider this acceptable :-) I spent some time with tools like gprof(1) and some code fiddling until I tracked down the cause of this slowness: our RCS API was performing excessive buffer creation and copy operations in revision retrieval. As explained in a previous article, in order to retrieve a particular revision, you typically first retrieve the HEAD revision, then calculate the final data by reverse-applying the diffs in the deltatexts up to the desired revision. This reverse application of diffs was the cause of the slowness.

At the heart of the problem lay a poor choice of data structures. One of my favourite books from the world of Computer Science academia is Data Structures + Algorithms = Programs by Niklaus Wirth (creator of the Pascal language). Central to the book is the doctrine that when writing a program, you should think hard about what data structures to use just as you think hard about the algorithm to use. In fact, you should pretty much choose the data structure before you choose the algorithm. We'd have been wise to follow this advice while writing OpenCVS! In the OpenCVS and OpenRCS codebases, we have a generic data buffer type called BUF which is used all over the place. BUF is not so different from a standard, flat, malloc()'d memory chunk, but it has lots of convenient handler functions. In particular, you can append data to a BUF type and it will automatically realloc() memory and resize itself. Additionally, it is trivial to load a file into a BUF type or write a BUF out to disk.

BUF had been chosen as the data structure to hold revisions, and the rcs_getrev() function would return a BUF with the contents of the requested revision in it to the caller. While this was very convenient — since it makes it trivial to write a revision out to disk, etc. — it was extremely expensive. The BUF data structure is quite far removed from how revisions are actually generated and handled at lower levels. The deltatexts are line-oriented, each line being represented as a node in a linked list. To return the revision as a BUF, the linked list had to be iterated over with each line being appended to the BUF. This wasted memory by having data that was already loaded in the linked list be duplicated in the BUF, and furthermore forced very many slow malloc(), realloc() and memcpy() calls. To add insult to injury, applying RCS patches (to fetch a non-HEAD revision, or a HEAD which was on a branch) would require not only converting linked lists to a BUF, but also the reverse — splitting a BUF back into a linked list. Doing things this way was very inefficient.

Let's step back for a moment. What do we really end up doing with revisions? Looking through the code, it becomes apparent that we mostly just send them to a file descriptor. If this is the case, do we need all this juggling to shove a revision into a BUF type? The answer turned out to be a resounding no. I rewrote the RCS API so that it provided a number of concise functions. rcs_rev_getlines() returns the "native" linked list structure representing the revision. rcs_rev_write_to_fd() calls rcs_rev_getlines() to get the revision, then simply iterates through, writing each line out to the destination file descriptor. I added a couple of convenience functions like rcs_rev_write_stmp() which generates a secure temporary filename and uses rcs_rev_write_fd() to write the revision to it, and also handles putting that temporary file onto the cleanup work list. Finally, I made sure that there was a BUF returning function — only a tiny bit faster than the old, slow implementation, but it would provide backwards compatibility. Backwards compatibility was important because I wanted to keep the initial diff as small as possible. The idea was to switch the simple pieces over to the new API while leaving the more complicated bits for later.

After my initial work, with only 'checkout' fully converted to the new RCS API, I saw approximately one hundred time performance increase in certain areas of the tree. It went from taking over two minutes of CPU time to check out a large, branched binary to taking a second or two on my machine. Unsurprisingly, memory usage was also greatly reduced.

Once this was in-tree, joris@ got all excited and jumped right in, quickly converting the more involved parts of the code, such as commit, update and diff, over to the new RCS API.

The result of all this work is that OpenCVS is quite usable for day-to-day use, in both remote and local mode. Even at this early stage, there are important benefits from the clean, ground-up CVS implementation. Already, there are things OpenCVS can do that GNU CVS cannot. Notably, the XF4 CVS module can be checked out successfully on 64-bit archs. With GNU CVS, this fails due to some bug in that codebase.

Of course, glitches and incompatibilities remain in OpenCVS, and no doubt there are plenty of bugs to discover. For this reason, we'd like to ask the user community to try out OpenCVS and start testing! Send bug reports (please include detailed info on how to reproduce) to niallo@, joris@, ray@ and xsa@. As always, patches are most welcome :-)

Remember that OpenCVS is not yet linked into the build, so you need to check out the current source tree, cd /usr/src/usr.bin/cvs and type make && make install to get the opencvs binary on your system.

(Comments are closed)

Comments

By joris@ (209.59.105.217) on 2007-01-17 17:27

Remote checkout was suppose to be commited before this got
published but I got delayed somewhat, it will be commited either
tonight or tomorrow morning.
By Marc Balmer (213.189.137.178) mbalmer@openbsd.org on 2007-01-17 17:27

Cool to see progress on OpenCVS!
By Paul Irofti (bulibuta) bulibuta@gmail.com on 2007-01-17 17:36

I'll start migrating all my home projects to OpenCVS. Thank you very much and keep up the good work (I know a lot of people just waiting for the first OpenCVS release).
By veins (81.65.210.9) veins@evilkittens.org on 2007-01-17 18:10 http://www.evilkittens.org

Just great, I'll switch my repositories to OpenCVS tomorrow :)
By Anonymous Coward (63.81.240.99) on 2007-01-18 03:35

Thank you for OpenCVS and this write up. It's a very interesting read.
By Frank DENIS (82.224.188.215) fdenis@skyrock.com on 2007-01-18 08:56 http://forum.manucure.info

I just tried to replace GNU CVS by OpenCVS on an OpenBSD mirror server, but an update from a remote client always fails with EBUSY:

$ opencvs -z3 update -Pd
U dev/usb/if_axe.c
U dev/usb/usbdevs
U dev/usb/usbdevs.h
U dev/usb/usbdevs_data.h
cvs server: .: Device busy
Comments
1. By joris@ (208.158.5.163) on 2007-01-18 13:29
  
  > I just tried to replace GNU CVS by OpenCVS on an OpenBSD mirror server, but an update from a remote client always fails with EBUSY:
  >
  > $ opencvs -z3 update -Pd
  > U dev/usb/if_axe.c
  > U dev/usb/usbdevs
  > U dev/usb/usbdevs.h
  > U dev/usb/usbdevs_data.h
  > cvs server: .: Device busy
  >
  >
  
  send a bug report and include a trace (-t) and a little
  bit about the setup of the repository. because the above
  is just not enough for us to understand the problem.
  Comments
  1. By Anonymous Coward (82.224.188.215) on 2007-01-18 22:28
    
    > send a bug report and include a trace (-t) and a little
    > bit about the setup of the repository. because the above
    > is just not enough for us to understand the problem.
    
    I just sent you a mail with a trace.
    
    Thanks again.
By JFJ (212.182.106.114) on 2007-01-18 10:09 http://fred.gnook.org

Just Great! OpenCVS is what a lot of ppl were waiting for. Thanks a lot.
--
JFJ
By Magnus Holmberg (mho) undeadly@mho.nu on 2007-01-18 15:10

"Notably, the XF4 CVS module can be checked out successfully on 64-bit archs. With GNU CVS, this fails due to some bug in that codebase."

Actually, this "bug" is due to to the dlimit in OpenBSD's CVSROOT being to small:-) If you increase dlimit in CVSROOT/options and CVSROOT/config (don't remember off hand which one is the problem), even the patologial XF4/xc/fonts/bdf/misc/18x18ja.bdf checks out fine. (479235 lines makes for a lot of large 64bit pointer arrays in GNU CVS implementation...)

- mho
Comments
1. By niallo (82.195.149.9) niallo@openbsd.org on 2007-01-18 19:22
  
  > "Notably, the XF4 CVS module can be checked out successfully on 64-bit archs. With GNU CVS, this fails due to some bug in that codebase."
  >
  > Actually, this "bug" is due to to the dlimit in OpenBSD's CVSROOT being to small:-) If you increase dlimit in CVSROOT/options and CVSROOT/config (don't remember off hand which one is the problem), even the patologial XF4/xc/fonts/bdf/misc/18x18ja.bdf checks out fine. (479235 lines makes for a lot of large 64bit pointer arrays in GNU CVS implementation...)
  >
  > - mho
  >
  
  Your comment is interesting. However, I disagree with your implication that this is not a bug. An application like CVS should be able to operate within the default memory limits of OpenBSD. OpenCVS can do it, afterall. As far as I'm concerned, this is indeed a bug - according the usual sense of the word - in the the GNU implementation. As I see things, you shouldn't need to go fiddle with some weird file under the CVSROOT to have things check out properly. It should "just work". Not even Firefox requires fiddling to run under OpenBSD
  
  I am surprised that people would make excuses for poor memory usage in an application of this nature. Perhaps this is an example of differences in BSD and GNU software development philosophies?
  Comments
  1. By Steve Shockley (68.80.137.106) on 2007-01-19 03:19
    
    > I am surprised that people would make excuses for poor memory usage in an application of this nature. Perhaps this is an example of differences in BSD and GNU software development philosophies?
    
    Perhaps "workaround" would be a better term...
  2. By Magnus Holmberg (mho) on 2007-01-20 00:02
    
    > Your comment is interesting. However, I disagree with your implication that this is not a bug. An application like CVS should be able to operate within the default memory limits of OpenBSD.
    
    It's not an OpenBSD default memory limit; it's an artificial limit put on the CVS repo - most likely to prevent cvs.openbsd.org/anoncvs servers from overload. The dlimit keyword will cause GNU CVS to ulimit itself on startup (which I suspect OpenCVS doesn't do). The OS limits are more than enough unless fiddled with.
    
    > I am surprised that people would make excuses for poor memory usage in an application of this nature. Perhaps this is an example of differences in BSD and GNU software development philosophies?
    
    Not so much making excuses as relating painfully hard-won experience from back when I was running anoncvs.se.openbsd.org on an Ultra-1... A couple of hours running CVS inside gdb and reading (rather unreadable) CVS code, until I found out it was a configuration issue that made it run out of memory.... People unlucky enough to be near me when I figured it out probably still have hearing trouble, 4 years on:-/
    
    - mho
    
    Comments
    
    By Magnus Holmberg (mho) on 2007-01-20 01:37
    
    A The dlimit keyword will cause GNU CVS to ulimit itself on startup (which I suspect OpenCVS doesn't do).
    
    Open foot, insert mouth! I should read code before I post:-) I see that OpenCVS does, indeed, support the dlimit keyword.
    
    The default 48MB dlimit seems to be enough to check out XF4 (using GNU CVS), nowadays, on both amd64 and sparc64, though. I also tried a 3.6/sparc64; it bombed out unless I increased dlimit to 64MB...
    
    (And you are probably right; in GNU Universe, memory is endless, so it is not a problem to realloc() two (or was it three?) arrays of pointers to half-a-million line-structs:-))
    
    - mho
    
    Comments
    
    By Anonymous Coward (87.79.237.121) on 2007-01-20 07:35
    
    > A The dlimit keyword will cause GNU CVS to ulimit itself on startup
    
    Interestingly enough, GNU CVS 1.12 does not support
    this keyword. But thanks, this finally clears up,
    after years, why I had trouble with the 18x18 fonts
    (and some others - possibly memory corruption _after_
    dealing with them) some time ago. The 18x18ko one
    is even larger, by the way.
By Igor Sobrado (156.35.192.4) on 2007-01-19 19:01

Nice work!

Hope that binary diffs will be supported in the future. I had a chance to test binary support in OpenRCS (not OpenCVS) some weeks ago, and binary patches would be great for distributing fixes to the operating system. Don't know if it will be possible, however, as binary support should be extended to other tools like patch, diff and diff3.

I am glad to see that these superb flavours of RCS and cvs are being developed for OpenBSD. These tools will be an excellent replacement for GNU counterparts.

Cheers!
Igor.

Latest Articles

Fri, Apr 26
- 08:33 Passphrase timeout for disk decryption at boot added (potential battery lifesaver) (2)
Wed, Apr 24
- 04:35 Game of Trees 0.98 released (2)
Tue, Apr 23
- 05:25 pfctl(8) and systat(8) to display fragment reassembly statistics (0)
Thu, Apr 18
- 05:05 Coming soon to a -current system near you: parallel raw IP input (0)
Wed, Apr 17
- 05:33 In -current, default write format for tar(1) changed to "pax" (15)
Wed, Apr 10
- 18:50 OpenSMTPD 7.5.0p0 Released (2)
Tue, Apr 09
- 04:49 20 years since "and we're just starting": undeadly.org turns 20 (2024-04-09) (13)
Fri, Apr 05
- 06:16 OpenBSD 7.5 released (3)
Thu, Mar 28
- 18:18 LibreSSL 3.8.4 and 3.9.1 released (0)

Credits

Copyright © 2004-2008 Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to April 2nd 2004 as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]