OpenBSD Journal

New mandoc -mdoc -T markdown converter

Contributed by pitrh on from the mark me up before you go-go dept.

If you follow commits closely, via source-changes@ or otherwise, you may already know that mandoc has grown another useful feature. Ingo Schwarze sent us this very nicely formatted article about the new mandoc to markdown converter:

New mandoc -mdoc -T markdown converter

I just committed a new mandoc(1) output formatter to OpenBSD-current, for converting manual pages written in the mdoc(7) markup language to markdown. The point is that in some contexts, documentation authors are required by third-party policies to provide markdown versions of their documentation. This new output mode allows them to maintain only one copy of their documentation in the well-known, simple, and high quality mdoc(7) language while still providing markdown versions for the purposes where those are required, which may for example include pasting them into Wikis. Thanks to Reyk@ Flöter (OpenBSD) and to Vsevolod@ Stakhov (FreeBSD) for suggesting such an output mode, and to Kristaps@ Dzonsons (bsd.lv) for contributing several ideas to this writeup.

The reason for providing this output mode is not that i consider markdown a good, or even a half-decent, markup language. Quite to the contrary, I hereby offcially declare it the shittiest markup language i have seen so far. Basically, it hasn't any strong point whatsoever, but the downsides are numerous, scary, and cover practically every relevant aspect:

Lack of expressiveness:

Markdown is pitifully weak and powerless even by its own standard, which is: make formatting easy for anything that can be expressed in a plain-text email.

For example, it doesn't provide any syntax for definition lists (<dl> in HTML, .Bl -tag in mdoc(7), .TP in man(7)) even though such lists can easily be written in a plain-text email.

Context sensitivity:

The syntax and semantics are extremely context sensitive. Almost every token can take completely different meanings depending on where it appears.

Ambiguity:

The syntax for emphasis by enclosing in asterisks or underscores is terribly ill-designed because it gives rise to no end of ambiguity — and not just the classic example of long_var_name, but also confusion about start and end tags. For example, **bold***italic* works as expected, but if you add another **bold**, as in **bold***italic***bold**, it may become <strong>bold<strong><em>italic</em></strong>bold</strong>, at least with some markdown compilers.

Mixup of semantic and presentational markup:

You can't switch off filling (which is a presentational manipulation) without getting <code> tags (which is semantic markup). You can't get indentation (presentational!) without either <code> or <blockquote> (both semantic).

Admittedly, early versions of HTML had similar problems. For example, <i> was originally designed to be presentational; in HTML 5, it is now properly semantic, and the presentational aspects are relegated to CSS, where they belong.

Kristaps summed this up succinctly: "HTML 5 is (kinda) semantic; markdown is not."

In theory, HTML code generated from markdown input could be improved if parser maintainers would choose to generate HTML output that is less encumbered with unintended semantic connotations. But Kristaps tells me parser maintainers rarely do that, for two reasons. Through inertia, most CSS files for markdown-generated HTML now expect these cruddy HTML constructs. And so do some tools that check the output of markdown-to-HTML converters for "correctness", checking that the emitted tags agree with tradition rather than checking whether they make sense semantically.

Lack of independence:

Markdown is not at all a self-contained language. It allows embedding arbitrary HTML code, both at the block and at the flow level. That makes writing any parser for it very hard because you basically have to include a full HTML parser and then add context sensitive complications on top of it. You also have to worry about all the security caveats of HTML. For example, HTML allows embedding Javascript, so you get to implement a Javascript interpreter as well, and to secure it.

Fortunately, i did not have to implement a markdown parser, mandoc(1) only needs to write markdown, not read it. Reading markdown code is the job for lowdown(1).

So far, so bad: you get all the downsides of HTML for sure. But you get almost none of the benefits of HTML because markdown imposes lots of arbitrary and crippling restrictions on how you can use HTML. For example, inside unfilled text, you can neither use named or numbered character references, nor flow-level elements like <em>, nor even native markdown formatting instructions like **. You can't use any block-level HTML elements inside any text that is to be indented. You can't use any kind of markdown formatting inside block-level HTML elements. As an example, even if you are willing to write definition lists in HTML syntax, their list items cannot contain nested markdown lists or displays, nor can the items of markdown lists contain definition lists. While markdown list elements can contain paragraph breaks, that no longer works when the list as a whole is indented. In that case, a paragraph break terminates the list. And so on and so forth, no end of traps here...

Of course, you can work around such nesting restrictions by writing all parent and child elements of the HTML block you want to nest in HTML rather than in markdown syntax, even if markdown syntax exists for these parent and child elements when they appear in isolation. But that mostly defeats the purpose of the whole exercise, making you wonder why you ever chose markdown over HTML in the first place.

In addition, markdown was originally intended for autogenerating exactly one target language: HTML. Having only one target language in mind when designing a new meta-language is obviously already a bad idea, but choosing HTML as a target language is even worse, because HTML is notoriously difficult to translate into other formats. So even leaving the many design failures listed above aside, the basic approach of mainly targetting HTML already curtails most of the potential benefit of inventing a simpler markup language.

Syntax inspired by Whitespace:

A line break without a paragraph break requires whitespace at the end of the preceding line, but the number of trailing blanks is semantically significant: there must be at least two. So, the two line endings "foo " and "foo  " have different meaning.

Lack of standardization:

The most official reference manual for markdown is the original one written by John Gruber in 2004. It is unmaintained since that time and leaves various ambiguities, such that different parsers tend to parse input somewhat differently in detail. In a language starved for features, that's particularly unfortunate because you usually can't use any alternative syntax to avoid the ambiguities because usually there aren't any alternatives at all.

Lack of extensibility:

The language clearly wasn't designed with extensibility in mind, and it shows. That alone would not necessarily be an important downside: if a language is well-designed in the first place, even if it is not extensible, it easily beats an ill-designed extensible language. But unfortunately, markdown is both ill-designed and lacks many important features, so this language would really need extensibility to become usable at all.

Consequently, many different people went ahead and implemented ("designed" would probably be the wrong word here: i don't think software design is part of the picture when it comes to markdown) their own ad-hoc extensions. Some of them are no doubt useful, but all the various versions of the language that exist in the wild are now incompatible with each other. Some people say this is the main weakness of markdown as a langauge, but i don't agree. Sure, it is one annoying weakness, but there are many others that are even worse.

For example, i ranted above about the lack of definition lists. PHP Markdown Extra, Python Markdown, and pandoc appear to support a syntax for them, and so may Github, although it doesn't appear to be documented for Github.

To avoid the mess of extensions that may or may not be supported, mandoc(1) only generates code according to John Gruber's original specification and does not rely on any extensions. Of course, that does not avoid the danger that some plain text in the markdown code generated from your document may accidentally trigger some extension handling in whatever markdown compiler you are using.

In case you wonder — here is how i think that a few other markup languages compare:

  1. LaTeX: Very good. Very powerful in the first place, and very easy to extend. Extension mechanisms are so strong that it is almost usable as a general-purpose programming language. Little context sensitivity and ambiguity. The syntax is slightly cumbersome, but still more palatable than HTML. The excessive size of the TeX Live distribution is a serious nuisance. The official death of the Texinfo project implies that LaTeX has become irrelevant for software documentation.
  2. roff(7): Very good. Very simple and friendly syntax, works well even with diff(1). Reasonably powerful in the first place, and the extension mechanisms are very powerful. Some context sensitivity, but not too bad. Unfortunately, while extensibility is powerful, it requires unusual, fragile, and sometimes downright ugly syntax — but that is of little importance because it rarely affects end-users.
  3. HTML: Acceptable. The basics are very easy to learn. But HTML without CSS is of limited use, and CSS is terribly overengineered, while at the same time lacking important features — the landmark symptom of a botched design. Even though designed for extensibility, that is almost unusable in practice because XSLT and its sub-languages are among the most hostile languages on the planet.
  4. DocBook: Abominable. Overengineered beyond absurdity, ridiculously slow toolchain, syntax encumbers the source code to the point of making it unreadable. The man(7) output of the standard tool chain is by far the lowest quality autogenerated man code of any tool that i'm aware of. Absolutely never use use DocBook for anything. As a language, in theory, it is probably better designed than markdown, but that is irrelevant because it is even more unusable than markdown in practice.
  5. markdown: Abominable. See above.
  6. There are a few others (for example AsciiDoc, reStructuredText, ...) but i dare not judge them because i have too little experience with them.
  7. OpenDocument: Oh did i really mention that? How stupid of me. It's not April 1st yet.

So, the bottom line is: Do not use markdown. Do not use DocBook. Do not use Texinfo. Use mdoc(7) to maintain your source documents, and mandoc(1) to convert them when needed (including to simple PostScript or PDF output), or use groff(1) if you need to convert them to high-quality PostScript or PDF output.

(Comments are closed)


Comments
  1. By A Random Scholar (70.71.121.227) on

    I agree with your assessment of the various markup language options.

    I have several thesis in progress where I need to include: images (png, jpeg) with captions, and short quotations in Latin, Greek, Hebrew, and Ethiopian (UTF-8) alphabets. Also, the structure of the documents is book, chapter, subsection, and footnotes.

    Is mdoc/mandoc suitable for this use case? Finding a format other than HTML that let's me easily mix in short snippets of foreign languages while also giving nice printed output has been a long journey.

    Comments
    1. By Ingo Schwarze (schwarze) on mdocml.bsd.lv

      > Is mdoc/mandoc suitable for this use case?

      No, it is absolutely inappropriate. You cannot use mdoc(7) for that at all. It does not support any kind of images or captions, it does not support RTL scripts at all, it does not support chapters at all, it does not support footnotes at all, and it is not very well-suited to documents containing more than trivial amounts of non-latin scripts.

      The mandoc(1) program is highly specialized for handling computer software documentation, with a strong bias towards English text.

      > Finding a format other than HTML

      HTML is an extremely poor choice for typesetting a book. Don't do that, ever. It will look very ugly and unprofessional no matter how careful you do it, and maintaining it will be a nightmare.

      I'm fairly sure that what you want can be done both with LaTeX and with groff - except that i don't know how good RTL support is. If you would force me to provide a definite recommendation on the spot, i would probably recommend LaTeX over groff, for two reasons: It is more actively maintained, so the risk of having issues with RTL is probably lower. But i don't know for sure. And configuring fonts in groff is probably more finicky than in LaTeX - and you will have to manually configure fonts for several languages. With groff, not even a cyrillic font is installed by default, even though that is much more widely used than Ethiopian.

      But take that with a grain of salt. I'm a physicist and i have done some typesetting of mathematical formula and scientific diagrams. I have never done typesetting for linguistics, and i may miss important aspects needed in your field. You should definitely go and get advice from a professional researcher in linguistics.

      Comments
      1. By Maybe Not Kristaps (46.11.108.131) on

        What everybody's thinking: when can we expect ingoml for writing documentation when mdoc(7) doesn't suit?

        Comments
        1. By Ingo Schwarze (schwarze) on mdocml.bsd.lv

          > when can we expect

          Don't you know that question is illegal in OpenBSD land?
          Stuff is ready when it is ready.
          No cheap talk. No roadmap committees. Shut up and hack.

          > ingoml for writing documentation when mdoc(7) doesn't suit?

          No plans. I don't see any need for that.
          For *documentation*, it is an asset, not a limitation, to conform to mdoc(7) conventions.
          It provides a good structure and helps users because they are already familiar with the style.

          If you write documentation and mdoc(7) seems like a poor fit, you are doing something wrong.

          That said, for non-documentation typesetting, mdoc(7) cannot become the right tool, or it would lose its strength for documenation. But i'm not planning to write another typesetting language. That is not an area where there is need for something new. Two very good typesetting languages already exist: (La)TeX and roff. Besides, i lack the knowledge about typesetting required to do even better. Typesetting is a craft and an art.

    2. By Anonymous Coward (2001:1a50:50dd:100:349a:52fe:b99f:33e6) on

      > I have several thesis in progress where I need to include: images (png, jpeg) with captions, and short quotations in Latin, Greek, Hebrew, and Ethiopian (UTF-8) alphabets. Also, the structure of the documents is book, chapter, subsection, and footnotes.

      fwiw, i have seen exactly that combination with s/Ethiopian/Hieroglyphs/ in a very nice 400 page LaTeX document.

  2. By thuban (109.190.193.124) on http://yeuxdelibad.net

    While reading all markdown caveats, it made me think to txt2tags [1], more powerful than markdown by its syntax and not html-only.

    [1] : http://txt2tags.org/

    Comments
    1. By kraileth (212.77.224.251) kraileth@elderlinux.org on http://www.elderlinux.org

      > While reading all markdown caveats, it made me think to txt2tags [1], more powerful than markdown by its syntax and not html-only.
      >
      > [1] : http://txt2tags.org/

      Thanks for sharing this; I've been fighting with markdown, multimarkdown and such in the past. While I liked textile best it still wasn't a perfect solution for me. Txt2tags somehow eluded me so far but I'll give it a try - until we can all adopt the new shiny IngoML! (Never say die! ;))

      Comments
      1. By Ed Ahlsen-Girard (girard) on

        > > While reading all markdown caveats, it made me think to txt2tags [1], more powerful than markdown by its syntax and not html-only. > > > > [1] : http://txt2tags.org/ > > Thanks for sharing this; I've been fighting with markdown, multimarkdown and such in the past. While I liked textile best it still wasn't a perfect solution for me. Txt2tags somehow eluded me so far but I'll give it a try - until we can all adopt the new shiny IngoML! (Never say die! ;)) Plain Old Documentation from perl (aka pod) was adequate to write Programming Perl (aka the Camel Book) with, but it was very light on illustrations.

  3. By patrik (80.252.185.200) on

    Did you consider targeting http://commonmark.org/ instead of http://daringfireball.net/projects/markdown/?
    My understanding is that they try to address much of the ambiguity with Markdown, while being relatively compatible with existing parsers.

    Comments
    1. By Ingo Schwarze (schwarze) on mdocml.bsd.lv

      > Did you consider targeting http://commonmark.org/

      Not before implementing the output mode.

      Because of your comment, i worked through the CommonMark specification, fixed about half a dozen bugs in my implementation where it violated CommonMark, and added a link to the mandoc(1) manual page.

      The basic idea of the CommonMark project is laudable: consensus-based standardization as compatible with existing practice as possible, a reference implementation, and various testing tools.

      Actually reading what they have so far, and what they consider almost finished, is not so much fun. The original idea of markdown was simplicity. The CommonMark specification is very long, very complicated, and in some places complicated to the point of absurdity. As the worst example, just read the long chapter about delimiter runs, which culminates in a list of seventeen (!!) precedence rules. Even though i inspected most of the rest, I'm not willing to study and implement that. It is plainly going over the top. There are various other examples of excessive complexity in the specification.

      Ultimately, it only reinforces my point made in the Undeadly article above. The markdown language is utterly ill-designed from the ground up, and no standardization effort can cure the numerous ailments. It ought to be abandoned outright. DO NOT USE IT. If others force you to produce documentation in markdown, consider writing mdoc(7) instead and converting to markdown with mandoc. Comparing the CommonMark specification to the mdoc(7) manual, writing mdoc(7) is several orders of magnitude easier, and you get semantic rather than very primitive presentational markup.

Credits

Copyright © - Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]