OpenBSD Journal
Home : : Add Story : : Archives : : About : : Create Account : : Login :
New mandoc -mdoc -T markdown converter
Contributed by pitrh on Sat Mar 4 16:47:50 2017 (GMT)
from the mark me up before you go-go dept.

If you follow commits closely, via source-changes@ or otherwise, you may already know that mandoc has grown another useful feature. Ingo Schwarze sent us this very nicely formatted article about the new mandoc to markdown converter:

New mandoc -mdoc -T markdown converter

I just committed a new mandoc(1) output formatter to OpenBSD-current, for converting manual pages written in the mdoc(7) markup language to markdown. The point is that in some contexts, documentation authors are required by third-party policies to provide markdown versions of their documentation. This new output mode allows them to maintain only one copy of their documentation in the well-known, simple, and high quality mdoc(7) language while still providing markdown versions for the purposes where those are required, which may for example include pasting them into Wikis. Thanks to Reyk@ Flöter (OpenBSD) and to Vsevolod@ Stakhov (FreeBSD) for suggesting such an output mode, and to Kristaps@ Dzonsons (bsd.lv) for contributing several ideas to this writeup.

The reason for providing this output mode is not that i consider markdown a good, or even a half-decent, markup language. Quite to the contrary, I hereby offcially declare it the shittiest markup language i have seen so far. Basically, it hasn't any strong point whatsoever, but the downsides are numerous, scary, and cover practically every relevant aspect:

Lack of expressiveness:

Markdown is pitifully weak and powerless even by its own standard, which is: make formatting easy for anything that can be expressed in a plain-text email.

For example, it doesn't provide any syntax for definition lists (<dl> in HTML, .Bl -tag in mdoc(7), .TP in man(7)) even though such lists can easily be written in a plain-text email.

Context sensitivity:

The syntax and semantics are extremely context sensitive. Almost every token can take completely different meanings depending on where it appears.

Ambiguity:

The syntax for emphasis by enclosing in asterisks or underscores is terribly ill-designed because it gives rise to no end of ambiguity — and not just the classic example of long_var_name, but also confusion about start and end tags. For example, **bold***italic* works as expected, but if you add another **bold**, as in **bold***italic***bold**, it may become <strong>bold<strong><em>italic</em></strong>bold</strong>, at least with some markdown compilers.

Mixup of semantic and presentational markup:

You can't switch off filling (which is a presentational manipulation) without getting <code> tags (which is semantic markup). You can't get indentation (presentational!) without either <code> or <blockquote> (both semantic).

Admittedly, early versions of HTML had similar problems. For example, <i> was originally designed to be presentational; in HTML 5, it is now properly semantic, and the presentational aspects are relegated to CSS, where they belong.

Kristaps summed this up succinctly: "HTML 5 is (kinda) semantic; markdown is not."

In theory, HTML code generated from markdown input could be improved if parser maintainers would choose to generate HTML output that is less encumbered with unintended semantic connotations. But Kristaps tells me parser maintainers rarely do that, for two reasons. Through inertia, most CSS files for markdown-generated HTML now expect these cruddy HTML constructs. And so do some tools that check the output of markdown-to-HTML converters for "correctness", checking that the emitted tags agree with tradition rather than checking whether they make sense semantically.

Lack of independence:

Markdown is not at all a self-contained language. It allows embedding arbitrary HTML code, both at the block and at the flow level. That makes writing any parser for it very hard because you basically have to include a full HTML parser and then add context sensitive complications on top of it. You also have to worry about all the security caveats of HTML. For example, HTML allows embedding Javascript, so you get to implement a Javascript interpreter as well, and to secure it.

Fortunately, i did not have to implement a markdown parser, mandoc(1) only needs to write markdown, not read it. Reading markdown code is the job for lowdown(1).

So far, so bad: you get all the downsides of HTML for sure. But you get almost none of the benefits of HTML because markdown imposes lots of arbitrary and crippling restrictions on how you can use HTML. For example, inside unfilled text, you can neither use named or numbered character references, nor flow-level elements like <em>, nor even native markdown formatting instructions like **. You can't use any block-level HTML elements inside any text that is to be indented. You can't use any kind of markdown formatting inside block-level HTML elements. As an example, even if you are willing to write definition lists in HTML syntax, their list items cannot contain nested markdown lists or displays, nor can the items of markdown lists contain definition lists. While markdown list elements can contain paragraph breaks, that no longer works when the list as a whole is indented. In that case, a paragraph break terminates the list. And so on and so forth, no end of traps here...

Of course, you can work around such nesting restrictions by writing all parent and child elements of the HTML block you want to nest in HTML rather than in markdown syntax, even if markdown syntax exists for these parent and child elements when they appear in isolation. But that mostly defeats the purpose of the whole exercise, making you wonder why you ever chose markdown over HTML in the first place.

In addition, markdown was originally intended for autogenerating exactly one target language: HTML. Having only one target language in mind when designing a new meta-language is obviously already a bad idea, but choosing HTML as a target language is even worse, because HTML is notoriously difficult to translate into other formats. So even leaving the many design failures listed above aside, the basic approach of mainly targetting HTML already curtails most of the potential benefit of inventing a simpler markup language.

Syntax inspired by Whitespace:

A line break without a paragraph break requires whitespace at the end of the preceding line, but the number of trailing blanks is semantically significant: there must be at least two. So, the two line endings "foo " and "foo  " have different meaning.

Lack of standardization:

The most official reference manual for markdown is the original one written by John Gruber in 2004. It is unmaintained since that time and leaves various ambiguities, such that different parsers tend to parse input somewhat differently in detail. In a language starved for features, that's particularly unfortunate because you usually can't use any alternative syntax to avoid the ambiguities because usually there aren't any alternatives at all.

Lack of extensibility:

The language clearly wasn't designed with extensibility in mind, and it shows. That alone would not necessarily be an important downside: if a language is well-designed in the first place, even if it is not extensible, it easily beats an ill-designed extensible language. But unfortunately, markdown is both ill-designed and lacks many important features, so this language would really need extensibility to become usable at all.

Consequently, many different people went ahead and implemented ("designed" would probably be the wrong word here: i don't think software design is part of the picture when it comes to markdown) their own ad-hoc extensions. Some of them are no doubt useful, but all the various versions of the language that exist in the wild are now incompatible with each other. Some people say this is the main weakness of markdown as a langauge, but i don't agree. Sure, it is one annoying weakness, but there are many others that are even worse.

For example, i ranted above about the lack of definition lists. PHP Markdown Extra, Python Markdown, and pandoc appear to support a syntax for them, and so may Github, although it doesn't appear to be documented for Github.

To avoid the mess of extensions that may or may not be supported, mandoc(1) only generates code according to John Gruber's original specification and does not rely on any extensions. Of course, that does not avoid the danger that some plain text in the markdown code generated from your document may accidentally trigger some extension handling in whatever markdown compiler you are using.

In case you wonder — here is how i think that a few other markup languages compare:

  1. LaTeX: Very good. Very powerful in the first place, and very easy to extend. Extension mechanisms are so strong that it is almost usable as a general-purpose programming language. Little context sensitivity and ambiguity. The syntax is slightly cumbersome, but still more palatable than HTML. The excessive size of the TeX Live distribution is a serious nuisance. The official death of the Texinfo project implies that LaTeX has become irrelevant for software documentation.
  2. roff(7): Very good. Very simple and friendly syntax, works well even with diff(1). Reasonably powerful in the first place, and the extension mechanisms are very powerful. Some context sensitivity, but not too bad. Unfortunately, while extensibility is powerful, it requires unusual, fragile, and sometimes downright ugly syntax — but that is of little importance because it rarely affects end-users.
  3. HTML: Acceptable. The basics are very easy to learn. But HTML without CSS is of limited use, and CSS is terribly overengineered, while at the same time lacking important features — the landmark symptom of a botched design. Even though designed for extensibility, that is almost unusable in practice because XSLT and its sub-languages are among the most hostile languages on the planet.
  4. DocBook: Abominable. Overengineered beyond absurdity, ridiculously slow toolchain, syntax encumbers the source code to the point of making it unreadable. The man(7) output of the standard tool chain is by far the lowest quality autogenerated man code of any tool that i'm aware of. Absolutely never use use DocBook for anything. As a language, in theory, it is probably better designed than markdown, but that is irrelevant because it is even more unusable than markdown in practice.
  5. markdown: Abominable. See above.
  6. There are a few others (for example AsciiDoc, reStructuredText, ...) but i dare not judge them because i have too little experience with them.
  7. OpenDocument: Oh did i really mention that? How stupid of me. It's not April 1st yet.

So, the bottom line is: Do not use markdown. Do not use DocBook. Do not use Texinfo. Use mdoc(7) to maintain your source documents, and mandoc(1) to convert them when needed (including to simple PostScript or PDF output), or use groff(1) if you need to convert them to high-quality PostScript or PDF output.

[topicopenbsd]

<< OpenBSD Foundation 2016 Fundraising | Reply | Flattened | Expanded | Ted Unangst on (even more) notable recent changes in OpenBSD >>

Threshold: Help

Related Links
more by pitrh


  Re: New mandoc -mdoc -T markdown converter (mod 1/125)
by A Random Scholar (70.71.121.227) on Sun Mar 5 01:52:42 2017 (GMT)
  I agree with your assessment of the various markup language options.

I have several thesis in progress where I need to include: images (png, jpeg) with captions, and short quotations in Latin, Greek, Hebrew, and Ethiopian (UTF-8) alphabets. Also, the structure of the documents is book, chapter, subsection, and footnotes.

Is mdoc/mandoc suitable for this use case? Finding a format other than HTML that let's me easily mix in short snippets of foreign languages while also giving nice printed output has been a long journey.
  [ Show thread ] [ Reply to this comment ] [ Mod Up ] [ Mod Down ]

  Re: New mandoc -mdoc -T markdown converter (mod 3/117)
by thuban (109.190.193.124) on Sun Mar 5 10:07:40 2017 (GMT)
http://yeuxdelibad.net
  While reading all markdown caveats, it made me think to txt2tags [1], more powerful than markdown by its syntax and not html-only.

[1] : http://txt2tags.org/
  [ Show thread ] [ Reply to this comment ] [ Mod Up ] [ Mod Down ]

  Re: New mandoc -mdoc -T markdown converter (mod 9/125)
by patrik (80.252.185.200) on Mon Mar 6 14:53:21 2017 (GMT)
  Did you consider targeting http://commonmark.org/ instead of http://daringfireball.net/projects/markdown/?
My understanding is that they try to address much of the ambiguity with Markdown, while being relatively compatible with existing parsers.
  [ Show thread ] [ Reply to this comment ] [ Mod Up ] [ Mod Down ]

[ Home | Add Story | Archives | Polls | About ]

Copyright © 2004-2008 Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to April 2nd 2004 as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. Some icons from slashdot.org used with permission from Kathleen. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. Search engine is ht://Dig. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]