OpenBSD Journal

docbook2mdoc-1.0.0 released

Contributed by Ingo Schwarze on from the DocBook-considered-woeful dept.

After doing active development on it for about a month, i just released version 1.0.0 of the DocBook to mdoc converter, docbook2mdoc(1). The OpenBSD port was updated, too. In a nutshell, docbook2mdoc was brought from experimental status to an early release that can be considered mostly usable for production, though no doubt there are still many rough edges. That's why i called it 1.0.0 and not 1.1.1.

Lots of features were added including support for many new DocBook XML elements and for two kinds of file inclusion, formatting was improved in many respects, and several reorganizations were done with respect to internal code structure. The expat library is no longer needed, and no other dependency is required.

See its homepage for all information about the utility and the release notes for details about this release.

Thanks to Stephen Gregoratto for a number of patches and many useful reports.

The rest of this article explains some important design and implementation decisions and mentions some use cases.

About the Oasis DocBook XML language

DocBook intends to provide authors of technical documentation, in particular about computer software, with a system of semantic markup. So far, that's an excellent idea. It's exactly the same goal which the mdoc(7) language pursues, too.

One difference between the two is that the DocBook language is much larger than the mdoc language — more than five times the size for basically the same task. While mdoc provides about 70 non-deprecated macros, DocBook version 5.1 defines over 400 elements. While some of them provide opportunities to mark up the meaning of words in more detail and some additional structuring elements can arguably be useful, in particular for larger documents, the bulk of the larger size is a downside rather than an advantage: Most elements are almost useless, giving excessive detail about the meaning of words that doesn't provide any real benefit but merely hinders the readability of the source code and causes pointless additional markup work for authors. Besides, the language is extremely redundant: almost everything can be expressed correctly in several different ways, making the language harder to learn, making document source code harder to read, and making documents look less uniform.

All the same, focussing our attention on elements that are crucial to software documentation, it turns out the DocBook language is not more powerful than mdoc, or at least not unambiguously so. In fact, there are a number of very useful mdoc macros that lack an equivalent in DocBook. For example, there are no good ways to represent markup information as fundamental as .Cm, .Ic, .In; equivalents of some more rarely used macros like .Cd are also missing. One might have hoped that a language five times the size would at least provide a proper superset of the crucial features, in particular since mdoc predates DocBook by a number of years, so the language designers could easily have pilfered ideas from mdoc, and even more so from the substantial corpus of mdoc documents that already existed by the time DocBook was invented.

So, bloated, redundant, and incomplete at the same time — that combination is no doubt a landmark symptom of a thoroughly botched design. But i didn't even get to the worst defects of DocBook yet. The crucial downsides are that it is underspecified, inherently non-portable, ill-designed with respect to many details, and lacks any kind of overall design or architecture. It's just a giant heap of unstructured, erratic bits and pieces. Let's look at some examples.

It is underspecified.

Even though there are more than 400 DocBook elements and even though with such a huge number, great expressive power could no doubt be achieved with great precision, large numbers of elements only vaguely specify their syntax and semantics. That even applies to some of the most important elements of all. For example, the language specification explicitly leaves it unspecified whether for a command line option, authors should write <option>a</option> or <option>-a</option>, and in the former case, whether the formatter should render <option>a</option> as "a" or as "-a". All the specification says is that some do it this way, some another. Of course it is blatantly obvious what a catastrophic design error it is to leave such a fundamental aspect unspecified, and i have of course encountered cases in real-world documents where it causes misformatting. That kind of thoughtlessness is not at all unusual in DocBook: many elements suffer from similar explicitly stated stupidity in the specification. Besides, there are many cases where the language in the specification is just vague and it remains unclear how the element is to be used and formatted, the result of course being that different real-world documents use the element in conflicting ways, expecting different formatting. In practice, the mess gets even worse. Given the vast size of the language, many authors simply fail to find the elements that would best suit their needs, instead grabbing whatever they happen to run into first.

It is inherently non-portable.

People will consider it obvious that when others give them a manual page written on one system and they format it with their own formatter on a different system, it will of course come out just right. Not so with DocBook. Not only will DocBook 4.5 documents almost certainly misformat or even abort formatting with fatal parsing errors with a DocBook 5.1 formatter even on the same system. But even when transferring a DocBook document of a specific version, say 5.1, from one system to another, it is unlikely to come out right, given the above point of how DocBook is underspecified.

It is ill-designed with respect to many details.

Let's take the crucial funcdef element as a typical example. Here is the example from the specification showing how it is intended to be used:

 <funcdef>int <function>max</function></funcdef>
 

What is wrong here?

  1. The "funcdef" element has text content. But that text content is not the function definition. It isn't even the function name. No, counter-intuitively, it is the function return type.
  2. Even though an element <type> exists to mark up type names (for example the "int" from the C language), that element is not supposed to be used here.
  3. The name "funcdef" is an obvious misnomer. This isn't a definition at all. At best, it's (part of) a declaration.
  4. The name "function" is also badly chosen. Given that elements "funcdef", "funcparams", "funcprototype", "funcsynopsis" exist, too, it should be something like "funcname".
  5. Why would you ever want to give a function name together with the return type, but without a parameter list? I can hardly imagine any situation where that would make sense. Yet, if you want to add the parameter list, you have to append another element, <paramdef>, after the </funcdef>, and wrap it all in yet another element, <funcprototype>.

In the end, you need four elements named quite confusingly and totalling over 90 bytes of overhead, for what mdoc does simply with ".Fo .Fa". Remember this is just one example out of more than 400 elements, so there are many such cases. Of course, it is impossible to summarize such low-level misdesign in a general way, and i won't bore you with more examples.

Lack of any consistent design or architecture (1).

Many elements are purely semantic, often leaving formatting unspecified. Many elements are purely presentational, leaving semantics unspecified. Many are a mixture of both. It is hard to predict what you might encounter in any specific subject area; usually, some of each kind. Some elements render their content serially, some reorder it, some are expected not to show all or part of it at all. Many elements take crucial content as attributes rather than child elements; i can't see any system of what is done with attributes, what with elements, and what with children, it feels totally random.

Lack of any consistent design or architecture (2).

As usual with XML, a significant part of the specification is devoted to the question of which elements can nest into which other elements, and as usual with XML, there are lots of syntactic restrictions in this respect. But which ones seems mostly arbitrary. Often, nestings are explicitly allowed that make no sense whatsoever. For example, you are allowed to nest a <group> of arguments into a single <arg>. For each element, there is an explicit list of elements allowed inside, and for most elements, this list contains dozens of entries. Multiplied by 400 elements, that's certainly well above ten thousand nesting rules. No human document author can possibly learn those rules, even less so because they feel so arbitrary and unsystematic. You have no choice but to constantly look them up.

In fact, the devastating quality of DocBook does not come as a surprise. It is essentially a design by committee product steered by big corporations: O'Reilly, AT&T, Sun, Novell, DEC, Fujitsu, HP, Hitachi and many others. Bloat, redundancy, lack of attention to detail, portability nightmares, and lack of design and architecture are the natural and expected outcomes of such a situation. I see no indication that any of the people who had previously done successful work on related topics (for example Doug McIlroy, Ken Thompson, Dennis Ritchie, Brian Kernighan, James Clark, or Cynthia Livingston; Joe Ossanna had unfortunately already died by that time) were even involved.

In a nutshell, it is corporate bloatware unfit for the world of free and open software.

On top of that, the standard toolchain for converting DocBook input to man(7) output is notorious for being exceedingly slow and producing man(7) code of extremely low quality — which is of course not the fault of the DocBook language, but rather of the very poorly developed and maintained formatting software. Even though that formatting toolchain could certainly be improved (if anybody were interested in working on it), the woeful state of the default formatter certainly curtails the usefulness of the already very bad language even further.

So, the executive summary about DocBook is pretty simple: never use it for anything.

However, given the substantial amounts of text that exist in the poor language, a tool is needed to convert that text to a better format. Enter Kristaps@, who dragged it onto the stage five years ago. And right now, i gave it a good fluff-up.

Ditching libexpat

You have probably heard people say that you should not parse XML or HTML by hand but instead use an existing parser library, and in general, that is good advice — not re-inventing the wheel often makes sense. Besides, HTML is often parsed off the wire and hence might be hostile, while parsers are prone to bugs, so using a well-tested parser library is usually a good idea from a security perspective, too.

Also, using a validating parser rather than just a parser is often a good idea. In general, validate your input and if it's invalid, error out rather than plodding on, or you will likely end up doing undefined and potentially insecure processing. In that sense, the expat library provides a strongly validating parser: in case of many classes of invalid input, it errors out. There is no option to recover and continue parsing.

However, for an input format that is as crappy as DocBook in the first place, we couldn't care less whether the input is valid or not. Also, erroring out was as hostile towards the user as we could possibly be: the poor guy already knows the input is in an undesirable format, that's the whole point of running docbook2mdoc in the first place. They just want to get the damn text out of it no matter what. So why should we ever throw up on them? Even the worst kinds of violations of XML well-formedness should not hinder recovery of the text.

So, generalities may not apply to specific situations, and expat was out.

I might have switched to libxml2, or maybe some other parser library that supports parsing invalid input. However, libxml2 requires iconv, and adding such a heavy dependency for a task as simple as basic XML parsing seemed excessive. Also, the task here really isn't to do standards-conforming sophisticated XML parsing with all the bells and whistles. The task is merely to extract the text and some element and attribute names from very simple input. There is actually value in ignoring excessively complex constructs in the input. So i wrote the XML parser from scratch, in ISO C. It's now 180 lines of code grand total (function parse_file() in file parse.c). That probably took less time than figuring out how to write the glue code to integrate libxml2 or a similar library, and it's certainly simpler and more flexible. Needless to say, don't do that when you write your next web browser.

Ditching the validator

A very significant fraction of the source code of docbook2mdoc used to be dedicated to the validation of XML element nesting. Even worse than having all that code around: almost all the work required for adding a new element to the parser had to be spent writing new nesting validation code. That work was slow and very tedious, slowed down development almost to a grinding halt, and all i learnt from that work is that the nesting rules make no sense whatsoever. And the only "benefit" from all that painfully written code was that the program would often refuse to render anything at all because it ran into an element it didn't know, or alternatively refuse its work and start an argument with me, even though i never wrote that XML code in that particular input file, that some elements ought to be nested differently. Grrr…

So i just threw out the nesting validation code completely. Before that change, most documents in the Xenocara tree would simply abort and fail to produce any output whatsoever. After the change, they all started producing more or less decent output.

Validation, debugging, and statistics

As explained above, it makes no sense to ask whether the input is valid or invalid from the DocBook perspective. However, it does make sense to ask how well docbook2mdoc can already handle it, because that information is needed to decide how the program can be improved. So errors are still thrown when docbook2mdoc encounters elements it doesn't know yet; these errors just aren't fatal any more. Also, a few serious kinds of malformed input are still reported: duplicate doctypes, more than one top-level element in a file, plain text before or after it, mismatching opening and closing tags, and so on. Again, none of those are fatal. For debugging purposes, docbook2mdoc now has a -T lint mode and a -T tree dumping mode.

I also wrote a small, stand-alone tool to count frequencies of elements and parent-child relations in DocBook corpora, to help decisions of what to implement next. That tool is only available in the CVS version. There is no need to put it into releases because people working on the code will work from CVS anyway, not from a release tarball.

Structural improvements

Sections and subsections

Sectioning elements are among the most redundant parts of DocBook. There are at least three different schemes: Books can have parts and chapters. Sections can use elements with or without explicit level numbers. And refentries, the DocBook equivalent of manual pages, have their own sectioning elements. When you see a <chapter> or <section>, you have no idea whether it is nested inside other sectioning elements, and if so, how deep. The docbook2mdoc utility now ignores all these pointless variations, internally mapping all sectioning elements to <section> and simply counting the nesting level itself. The top level is translated to .Sh, the second level to .Ss. KISS.

Paragraph handling

The way paragraphs work is completely different in DocBook and mdoc. In DocBook, they are elements containing text. In mdoc, only paragraph breaks are marked (with .Pp), and the paragraph macro is always empty. Some Docbook elements ignore paragraph elements immediately inside them, for example <entry> and <footnote>. Some mdoc macros imply paragraph breaks before themselves (e.g. .Bd and .Bl) or before and after (e.g. .Sh and .Ss). The formatter now handles this with a state variable. Some macros (like .Sh) block setting the state variable right afterwards. Some elements (in particular <para>) set the state variable unless it is blocked, requesting a break. Printing some macros (like .Bl) clears the state variable. Printing normal text or macros when the state variable is set results in emitting .Pp and clearing the state. To support mechanisms such as this one, nodes are now classified according to the kinds of mdoc macros they emit.

Input whitespace handling

The parser now detects when an element or string of text is preceded by whitespace, setting a flag in the node. The formatter can use that flag for formatting decisions, in particular concerning output whitespace.

Output whitespace handling

Another state variable now keeps track whether we are on a new output line, on a text line, or on a macro line. Text and macro argument output functions provide options to request printing with or without whitespace before their main argument. Actually, there are two kinds of output whitespace: On the one hand, whitespace in the mdoc output, e.g. spaces between words on text lines or spaces between macro arguments. On the other hand, whitespace in the final formatted output. Some of the combinations are easy: For example, printing text to a text line in no-spacing mode will simply not print a blank before the text. Some of the combinations are not trivial: For example, printing to a macro line in no-spacing mode will usually print a space, an Ns macro, and another space before the argument to suppress the space in the final output. Surprisingly, whitespace handling is usually among the most complicated aspects of text processing. (Huh? What did you say? No, we are not doing paragraph filling or adjustment, hyphenation, kerning, ligatures, nor italic corrections.)

Use cases

  • Kristaps' initial motivation for writing docbook2mdoc was formatting OpenGL documentation.
  • I have heard that people use it for GTK and for systemd documentation.
  • An idea i already mentioned at BSDCan 2014 but which was never actually realized was chaining doclifter(1) and docbook2mdoc(1) to convert from man(7) to mdoc(7). I'm likely to do that for real once we have something that we want to get lifted that way.
  • The /usr/xenocara/ tree contains about 250 DocBook files. Some of them might contain information worth extracting. I plan to look into that together with matthieu@.

Let me know about your usecases and how well it already works for them (or not). You should no longer expect complete failures, but certainly still rough edges, and some of the formatting can almost certainly still be improved.

(Comments are closed)


Comments
  1. By Predrag Punosevac (Oko) punosevac72@hotmail.com on

    Dear Ingo,

    Thank you for all the hard work and care you put over the years into OpenBSD documentation and tools for creating it, making OpenBSD the best documented OS in existence bar none. This is very informative and useful write up and I truly enjoyed reading it and learning something new.

Credits

Copyright © - Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]