Contributed by dwc on from the Check-This-Out! dept.
People have been noticing the recent progress made in OpenCVS, and it's long past time I wrote something for undeadly.org. So, what is new and where is OpenCVS now?
I took a three week vacation to visit friends and family back in Ireland over the Christmas period. With much more free time on my hands, I decided to tackle some of the trickier issues which had been blocking us for some time. Namely, support for binary files and rewriting the RCS parser.
Binary file support was one of the major items preventing us from being able to do a full checkout of the OpenBSD 'src' CVS module. The 'src' module is where most of the project commits go, which makes it a great test case for OpenCVS. However, there are a number of binary format files in the repository. Large parts of the RCS API code were written under an incorrect assumption, namely that all data was ASCII, consisting of NUL-terminated C-strings. Functions such as strlen(), strtol(), strlcpy(), etc.., were all being used to deal with binary data, which would cause them to choke. We had done some work on making the RCS parser work with arbitrary binary data at the c2k6 hackathon, and this was used as a starting point.
The basic idea was to switch from using char pointers to hold the data and then dynamically calculating buffer length using strlen(), to u_char pointers and a separate size_t length variable associated with the buffer. Code was changed to consider the new length variable authoritative, and to use that value instead of getting it from a dynamic strlen() call.
This change allowed us to run a full checkout of the 'src' module, with some caveats. Specifically, memory usage was extremely high — so much so that execution would abort under normal login class heap size limits — and we were around one hundred times slower than GNU CVS at checking out certain files. Strangely enough, I didn't consider this acceptable :-) I spent some time with tools like gprof(1) and some code fiddling until I tracked down the cause of this slowness: our RCS API was performing excessive buffer creation and copy operations in revision retrieval. As explained in a previous article, in order to retrieve a particular revision, you typically first retrieve the HEAD revision, then calculate the final data by reverse-applying the diffs in the deltatexts up to the desired revision. This reverse application of diffs was the cause of the slowness.
At the heart of the problem lay a poor choice of data structures. One of my favourite books from the world of Computer Science academia is Data Structures + Algorithms = Programs by Niklaus Wirth (creator of the Pascal language). Central to the book is the doctrine that when writing a program, you should think hard about what data structures to use just as you think hard about the algorithm to use. In fact, you should pretty much choose the data structure before you choose the algorithm. We'd have been wise to follow this advice while writing OpenCVS! In the OpenCVS and OpenRCS codebases, we have a generic data buffer type called BUF which is used all over the place. BUF is not so different from a standard, flat, malloc()'d memory chunk, but it has lots of convenient handler functions. In particular, you can append data to a BUF type and it will automatically realloc() memory and resize itself. Additionally, it is trivial to load a file into a BUF type or write a BUF out to disk.
BUF had been chosen as the data structure to hold revisions, and the rcs_getrev() function would return a BUF with the contents of the requested revision in it to the caller. While this was very convenient — since it makes it trivial to write a revision out to disk, etc. — it was extremely expensive. The BUF data structure is quite far removed from how revisions are actually generated and handled at lower levels. The deltatexts are line-oriented, each line being represented as a node in a linked list. To return the revision as a BUF, the linked list had to be iterated over with each line being appended to the BUF. This wasted memory by having data that was already loaded in the linked list be duplicated in the BUF, and furthermore forced very many slow malloc(), realloc() and memcpy() calls. To add insult to injury, applying RCS patches (to fetch a non-HEAD revision, or a HEAD which was on a branch) would require not only converting linked lists to a BUF, but also the reverse — splitting a BUF back into a linked list. Doing things this way was very inefficient.
Let's step back for a moment. What do we really end up doing with revisions? Looking through the code, it becomes apparent that we mostly just send them to a file descriptor. If this is the case, do we need all this juggling to shove a revision into a BUF type? The answer turned out to be a resounding no. I rewrote the RCS API so that it provided a number of concise functions. rcs_rev_getlines() returns the "native" linked list structure representing the revision. rcs_rev_write_to_fd() calls rcs_rev_getlines() to get the revision, then simply iterates through, writing each line out to the destination file descriptor. I added a couple of convenience functions like rcs_rev_write_stmp() which generates a secure temporary filename and uses rcs_rev_write_fd() to write the revision to it, and also handles putting that temporary file onto the cleanup work list. Finally, I made sure that there was a BUF returning function — only a tiny bit faster than the old, slow implementation, but it would provide backwards compatibility. Backwards compatibility was important because I wanted to keep the initial diff as small as possible. The idea was to switch the simple pieces over to the new API while leaving the more complicated bits for later.
After my initial work, with only 'checkout' fully converted to the new RCS API, I saw approximately one hundred time performance increase in certain areas of the tree. It went from taking over two minutes of CPU time to check out a large, branched binary to taking a second or two on my machine. Unsurprisingly, memory usage was also greatly reduced.
Once this was in-tree, joris@ got all excited and jumped right in, quickly converting the more involved parts of the code, such as commit, update and diff, over to the new RCS API.
The result of all this work is that OpenCVS is quite usable for day-to-day use, in both remote and local mode. Even at this early stage, there are important benefits from the clean, ground-up CVS implementation. Already, there are things OpenCVS can do that GNU CVS cannot. Notably, the XF4 CVS module can be checked out successfully on 64-bit archs. With GNU CVS, this fails due to some bug in that codebase.
Of course, glitches and incompatibilities remain in OpenCVS, and no doubt there are plenty of bugs to discover. For this reason, we'd like to ask the user community to try out OpenCVS and start testing! Send bug reports (please include detailed info on how to reproduce) to niallo@, joris@, ray@ and xsa@. As always, patches are most welcome :-)
Remember that OpenCVS is not yet linked into the build, so you need to check out the current source tree, cd /usr/src/usr.bin/cvs and type make && make install to get the opencvs binary on your system.
(Comments are closed)