OpenBSD Journal

[m2k10] mandoc mini-hackathon

Contributed by jcr on from the man-jokes-are-cliche dept.

Development on the mandoc(1) manual formatter is moving fast these days. Currently, mandoc was hacked on during two hackathon in less than two months. From May 13 to May 17, 2010, Kristaps Dzonsons (bsd.lv and OpenBSD), Joerg Sonnenberger (NetBSD) and Ingo Schwarze (OpenBSD) met at the BEC.de site in Elmenhorst near Rostock, Germany for a mini-hackathon (m2k10) dedicated exclusively on mandoc. Ingo was again focusing on mandoc during the yearly OpenBSD general hackathon (c2k10) and Kristaps was strongly supporting him remotely.

The mandoc utility is a lightweight, portable mdoc(7) and man(7) formatter written in C, started by Kristaps in 2008, so far supporting ASCII, HTML and simple PostScript output. Kristaps has committed to develop PostScript output during the current GSOC.

OpenBSD -current has recently switched over to pre-format the base system manuals with mandoc instead of groff during the system build. The current plan is to release OpenBSD 4.8 built with mandoc this autumn and to remove groff from the base system by the OpenBSD 4.9 release next spring. The NetBSD, FreeBSD and Dragonfly trees also include mandoc, and NetBSD is planning to eventually switch over the tree to mandoc just like OpenBSD did.

If you're unfamiliar with mandoc, you may enjoy reading the previous undeadly article on the topic, otherwise read on for more technical details of the on-going work.

The mandoc hackathon was splendidly hosted by Joerg Sonnenberger in the nice atrium hall of the BEC company, the perfectly calm and comfortable site as well as the fine cooking by Joerg allowed lots of focused and productive hacking in a friendly atmosphere, interrupted by nothing but a refreshing stroll along the cliff of the nearby Baltic sea on Sunday afternoon.

During the m2k10 hackathon, the following new features were implemented:

  • Kristaps wrote a new roff preprocessor, handling just those low-level roff instructions needed most in real world manuals. So far, the preprocesser fully parses and intelligently ignores .ig .de .dei .de1 .am .ami and .am1 roff instructions, and it fully parses .if .ie and .el instructions, handling a small initial subset of constant conditions. Nested "\{ \}" blocks, nested roff instructions (".ie n .de ..."), user-defined end macros (".ig myend"), and interpretation of instructions even in conditional-negative context (where required) are already supported.

  • Kristaps wrote a new framework to detect and handle the end of a sentence (EOS). It is simpler and more powerful than what Ingo had hacked up before, even though it is reusing a part of the ideas. In addition to mdoc(7), man(7) now has proper end-of-sentence handling, too. The main practical advantage is that EOS detection now works even when a sentence is enclosed in parentheses or quotes. Some isolated quoted characters in the middle of sentences (like '!' and "...") are still misinterpreted as the end of a sentence, but most such cases will not be too difficult to fix. There are no more end-of-sentence (EOS) pseudo-macros that must be handled by all formatting frontends, but EOS conditions are now communicated by node flags which frontends can optionally handle if they want to. In particular, the main terminal formatter function (term_flushln) became simpler because it doesn't need to bother with EOS tokens any more.

  • Tab-separated ".Bl -column" lists are now handled correctly even when the column-separating tab characters are part of quoted strings, even when those quoted strings start in the middle of a column. This allows great flexibility and convenience when designing simple tables and fixes pages such as sysctl(3) as well as many device driver manuals. (Kristaps)

  • Joerg has cleaned up considerable parts of the main program and its static functions (main.c) for simplicity and efficiency in general and for the consistency of error handling. For example, mmap(2) is now used for faster disk access. The mandoc utility now exits 0 when all pages specified on the command line were successfully formatted, and it exits with a value > 0 when an error occurred. This even works when processing multiple input files in -fign-errors mode: When one input file causes problems, mandoc will continue with the next one but still exit with a value > 0 at the end. We agreed to distinguish three classes of messages in the future:
    • FATAL: meaning errors aborting parsing of the affected file, generating no output for that file, but optionally continuing with other files.
    • ERRORS: meaning that something is so seriously wrong with a file that probably information will be lost or the document structure will be badly mangled; output is still generated, but without any guarantee regarding quality, and an error code will be returned at the end.
    • WARNINGS meaning that the input contains syntax problems or deprecated constructs that should be fixed, but do not prevent reliable formatting; this will not prevent mandoc from reporting success at the end.
    The implementation of the distinction of FATAL errors, ERRORS and WARNINGS has been started by Kristaps. When finished, this will help the ports(7) and pkgsrc systems in particular because ports can decide to use groff for formatting manuals that produce output with mandoc, but return error codes.

  • Ingo has implemented parsing, Abstract Syntax Tree (AST) representation and rendering of badly nested blocks, e.g. ".Aq aq Bo bo\nnl\n.Bc bc" rendering as "<aq [bo>nl] bc". Of course, this is incredibly ugly and cannot easily be mapped into HTML output, but such nesting violations occur in a sufficient number of real-world manuals that treating this as a FATAL parsing error was rather annoying.

  • Ingo has rewritten the main mdoc text parser, mdoc_ptext(), making it easier to understand and fixing various bugs. It is now correctly stripping white space from the end of text lines, in literal mode stripping tab characters as well, and it issues consistent warnings regarding trailing spaces and tabs on text lines. Besides, escaped backslashes no longer escape the following character.

  • Ingo has re-synced the main terminal output function term_flushln() between OpenBSD and the upstream version at bsd.lv, finally reliably eliminating the remaining cases of trailing white space in terminal output, and preparing for the upstream inclusion of a few features he had already committed to OpenBSD before the hackathon, such as proper handling of literal tab characters both in ragged and literal mode, correct line break handling both in nested lists and lists containing empty item bodies, and optional breaking of lines overflowing the right margin at existing hyphens.

  • Joerg has started a major redesign concerning the handling of horizontal spacing in terminal output, which still requires a lot of work but is expected to ultimately become simpler and more robust than what we have now.

  • Joerg started a regression testing framework on bsd.lv, integrating Ingo's existing tests from OpenBSD.

The following minor features were also implemented:

  • Formatting will not be aborted any more when invalid characters are detected in the input file, not even when running with -Tlint or -fstrict. There still is a warning, though. (Ingo)

  • Besides OPEN and CLOSE, a third class of delimiters has been defined, namely MIDDLE, containing the bar '|', such that the bar no longer falls out of macros and can be quoted without escaping. While here, the mdoc_isdelim() function was changed to return an enum instead of an int for additional type safety. (Ingo)

  • Manual sections are no longer required to be purely numeric, allowing section numbers like "1X", "3f" and "3p". (Kristaps)

  • The man(7) parser and formatter now handles the .AT (AT&T version) and .UC (BSD version) macros. (Joerg)

  • The .Ex (exit value) macro is no longer restricted to section 1, 6 and 8 manual pages. The .Fd (function declaration) is no longer restricted to the SYNOPSIS section and .Lb (library) is no longer restricted to the LIBRARY page section. (Joerg)

  • Handling of the default output width was cleaned up by Joerg, removing magical numbers and centralizing the required constants. The default right margin is now after column 78 for both mdoc(7) and man(7).

A couple of bugs were fixed, too:

  • White space between the end of the line and a trailing comment is now correctly stripped together with the comment, eliminating spurious white space from the output. (Joerg)

  • In ".Bl -column" lists, it is acceptable to not specify the width of the last column in the .Bl header, and in some cases this does indeed help to easily get optimal formatting. Thus, the related warning has been removed. (Joerg)

  • Formatting is no longer aborted when encountering an invalid argument to a .St (standard) or .At (AT&T version) macro. Instead, the argument is printed and a warning is issued. (Joerg)

  • The man(7) .IP (indented paragraph) macro does not force a double space after the head string any more, bringing its behaviour in line with groff. (Joerg)

Besides, several improvements committed to bsd.lv shortly before the hackathon have been integrated into the OpenBSD tree:

  • Multiple consecutive space characters in the input file are now preserved even outside literal mode. This agrees with what groff does and may be useful for better control over -Tascii formatting, though it is hardly portable to other output formats and not recommended for general use. (Kristaps)

  • The right text margin does not apply to literal context any more. This helps several pages to render prettier, in particular in the Perl manuals, but also some in base, e.g. syslog.conf(5). (Kristaps)

  • The .Cd (configuration declaration) and .Rv (return value) mdoc(7) macros are no longer restricted to certain page sections. (Kristaps)

  • A bug was fixed that caused some strings to be treated as macro invocations even when they were quoted. (Kristaps)

  • In -Txhtml mode, auto-closing of the LINK tag was fixed. (Daniel Friesel)

  • The \*(Ba predefined string "|" is now correctly treated as a delimiter. In particular, it does not get a leading dash any more when occurring inside the .Fl (flag) macro. (Kristaps)

  • The expected position of the EXIT STATUS section within the page was corrected to conform to FreeBSD conventions. (Ulrich Spoerlein)

  • The input column number is no more used to identify the first macro on a line. This broke when there is white space between the initial control character ('.') and the macro itself. (Kristaps)

  • The mdoc_arg* family of functions was modified to use enum instead of int return types, improving type safety and easing debugging. (Kristaps)

The last of the changes listed above have just been committed to both the bsd.lv portable tree and the OpenBSD in-tree production mandoc. New stuff is already starting to come in from the c2k10 hackathon and the GSOC, so stay tuned for more news in the future...

Bug reports should be sent to (kristaps _AT_ openbsd.org) and (schwarze _AT_ openbsd.org). A list of known issues is available on the todo list and more information can be found at mdocml.bsd.lv.

(Comments are closed)


Comments
  1. By Chris Bennett (chrisbennett) webmaster@bennettconstruction.us on www.bennettconstruction.us

    The right text margin does not apply to literal context any more. This helps several pages to render prettier, in particular in the Perl manuals,

    I can verify this, much nicer (proper!) presentation in one of my own perl scripts! Hurray!

  2. By J.C. Roberts (jcr) jcr@designtools.org on http://www.designtools.org

    I saw a somewhat misinformed question elsewhere, namely, "Why are OpenBSD and NetBSD using mandoc while FreeBSD is using mdocml?"

    The answer is simply the two are one in the same. Kristaps had some historical reason for the older "mdocml" name, but the new name of "mandoc" has seemed to replace it in most common usage.

    Comments
    1. By Kristaps Dzonsons (kristaps) on

      > I saw a somewhat misinformed question elsewhere, namely, "Why are OpenBSD and NetBSD using mandoc while FreeBSD is using mdocml?"
      >
      > The answer is simply the two are one in the same. Kristaps had some historical reason for the older "mdocml" name, but the new name of "mandoc" has seemed to replace it in most common usage.

      Well, an OpenBSD dev (you know who you are) did suggest "mandocpig" as an alternative... half manual, half document, and half pig.

Credits

Copyright © - Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]