Contributed by Ingo Schwarze on from the my manpages are now webscale dept.
Another major step forward just happened in
mandoc(1)
HTML output: paragraphs are now represented with real HTML
<p>
elements, and a number of cases were fixed
in which mandoc used to generate output violating HTML syntax,
mostly related to macros and requests that control
line filling
in paragraphs of text.
Using <p>
for paragraphs is important because
the main promise of the hypertext markup language — separation
of structure and content on the one hand from presentation and
style on the other
hand — only really holds up when documents use HTML elements
in the canonical way intended by the language design, not when they
abuse HTML features in weird ways to hack together the desired
visual effects.
And it should be even more obvious that producing syntactically
invalid output, even if only in certain infrequent situations,
wasn't good.
So how could it possibly happen that correctly using an element
as fundamental to HTML as <p>
took more than
ten
years of development?
Even though HTML formatting was the original motivation for writing
mandoc in the first place, which
Kristaps originally called
"mdocml"
for that very reason?
On the one hand, the
mdoc(7) and
HTML languages were
built on the same paradigm at the same time and share many
technical concepts.
Cynthia Livingston started development of mdoc at UC Berkeley in
1989 and completed the conversion of
the
first 170 manual pages to
her new language
in June 1990.
Tim Berners-Lee wrote his famous CERN
memo
in 1989 and started HTTP and HTML software development in late 1990.
The main difference obviously is that HTML is a general-purpose
markup-language whereas mdoc is strongly domain-specific for manual
pages, resulting in many HTML elements that are never needed by
mdoc documents and also resulting in several different mdoc macros
that all map to the same HTML element, for example
.Cd .Cm .Dl .Dv .Er .Ev .Fd .Fl .Fn .Fo .Ic .In .Nm .Ql
all mapping to <code>
.
But both languages initially provided some structural, some physical,
and some semantic markup.
Both provide a concept of both block and in-line elements.
Actually, some macros and elements work in almost the same way:
.Sh |
<h1> |
.Ss |
<h2> |
.Bl -tag .It |
<dl><dt><dd> |
.Bl -enum .It |
<ol><li> |
.Bl -bullet .It |
<ul><li> |
.Bd |
<div> |
.Lk |
<a> |
.Va |
<var> |
.Sy |
<b> |
.Em |
<i> |
.br |
<br> |
However, while the way section headers are marked up is similar
— the text of the title is wrapped in a macro or element, but
the body of the section usually isn't — there is a fundamental
difference in the representation of paragraphs: the mdoc language
only marks
paragraph breaks,
and there is no concept of any one paragraph extending from one
place to another, whereas HTML wraps the complete text of each
paragraph into a <p>
element.
Even worse, in mdoc, almost anything can be nested in almost anything,
but in HTML, there are severe syntactical restrictions on nesting.
HTML distinguishes two fundamentally different
kinds
of content: flow content and phrasing content.
Some HTML elements can only occur in flow content but not in phrasing
content, and some HTML elements can only contain phrasing content,
but not flow content.
The HTML paragraph elements
<p>
and
<pre>
are among the most restricted:
they can only occur in flow content, but they can only contain
phrasing content.
Now, <pre>
is the obvious representation for
.Bd -literal
blocks, and it is also logical to somehow represent .Pp
with <p>
.
But in mdoc, displays can be nested, and even literal displays
can contain paragraph breaks.
Translating that naïvely results in HTML syntax violations.
Consistently dealing with all the complications explained above required a number of steps.
- The
.Pp
macro must open a<p>
element — without having any idea how long that paragraph might remain open, and without being responsible for closing it again. - All other mdoc macros had to be taught whether their HTML
representation is allowed inside a paragraph — and those
where this is not the case must first close the existing paragraph
if there is any.
For example, this applies to the
.Pp
macro itself: before it can open its own paragraph, it must close the previous one, if any. But there are many more macros that need similar behaviour, including.Bd .Bf .Bl .D1 .Dl .Nd .Pp .Rs .Sh .Ss
. - The
.Bd -literal
and.Bd -unfilled
macros have to open a<pre>
element, and the matching.Ed
has to close it again. - However, if any of the macros that close
<p>
occur inside such an unfilled display, the<pre>
needs to be closed temporarily — and re-opened once the disruption has passed. - It gets even worse: Low-level
roff(7) requests to
switch to no-fill mode (
.nf
) and to switch back to fill mode (.fi
) also exist, and they interact with paragraphs and displays. For example, an author might manually switch fill mode back on with.fi
in the middle of a.Bd -unfilled
display, in which case the</pre>
at the end of the display must be omitted. - Such manual fill mode switches remain in force even across
macros having representations that cannot occur inside
<pre>
. For example, if a.Bl -enum
list occurs while.nf
is active, then the<pre>
must be closed before the<ol>
can be opened, but the<pre>
must be opened again inside each<li>
list item — and closed again before the end of each list item, and opened again after the end of the list... - When
.Pp
occurs inside<pre>
, it must neither be represented with<p>
nor close the<pre>
. Instead, it simply ought to be printed as a literal blank line. - The rules for man(7) documents are fundamentally similar, but differ in several details due to the different set of macros available.
All that is now implemented in mandoc
-T html,
and i see no more nesting syntax violations in any manual page
below /usr/share/man
.
In preparation for the above, large amounts of cleanup were performed, improving separation of different modules of the mandoc program and simplifying some aspects of the architecture.
In addition to this relatively complex improvement, a number of other features were added to mandoc during the last three months since EuroBSDCon 2018:
- Other HTML rendering features:
- Draw table and cell borders in tbl(7) HTML output.
- Span cells as specified by the tbl(7) layout in HTML output.
- Horizontal and vertical alignment of tbl(7) cell content in HTML output.
\f(CW
and\f(CR
(constant width font) are now supported in HTML output (so far, all missing features were reported by Pali Rohar)..br
is now rendered as<br/>
, no longer with<div>
.- Several regression tests were added for HTML output.
- Terminal rendering features:
- Use box drawing characters for tbl(7) borders in UTF-8 output (feature suggested by Anthony Bentley (bentley@)).
- Better automatic column width assignments in the presence of horizontal tbl(7) spans (issue reported by Ted Unangst (tedu@)).
.Bd -centered
now fills the text before centering it. This is substantially better than what groff(1) can do, which doesn't really center text in.Bd -centered
at all.
- Searching and tagging improvements:
- apropos(1) searches now use case-insensitive extended regular expressions by default, fixing a POSIX violation reported by Wolfram Schneider (wosch@) via Yuri Pankov (yuripv@) from FreeBSD.
- Port the deep linking that is familiar from mandoc-formatted manual
pages on the web to the command line with the new
-O tag
output option. For example, to jump to the same location
as the previous "-O tag" hyperlink, type
man -O tag=tag mandoc
- Strip the macro key when using the above feature in apropos
searches. For example, to jump directly to the documentation
of the ulimit
builtin command, without even having to specify the name of the
manual page (which happens to be
ksh(1)), the following
invocation is sufficient:
man -akO tag Ic=ulimit
- Tag the first word of multi-word macro arguments. For example,
to jump to the explanation of "query from", type:
man -O tag=query ntpd.conf
- Parser improvements:
- Many improvements to the handling, validation, and error reporting
of escape sequences; and new escape sequences
\_ \a \E \r
. - Some improvements to manual font selection with the
.ft
font request and the\f
escape sequence. \^
in tbl(7) data cells extends the data cell from above (missing feature reported by Pali Rohar).
- Many improvements to the handling, validation, and error reporting
of escape sequences; and new escape sequences
I deliberately refrain from listing all the bugfixes that were applied during the last three months and restrict the above list to only the new features.
(Comments are closed)
By Will Backman (bitgeist) bitgeist@yahoo.com on http://bsdtalk.blogspot.com
Thank you for such a thorough explanation! I had no idea how complex it was.
By John Gardner (Alhadis) gardnerjohng@gmail.com on https://github.com/Alhadis
Bravo! A *huge* step forward for mandoc(1), especially this:
> `.br is now rendered as <br/>, no longer with <div>.`
I can't tell you how much it pained me to see that. Well done, Ingo. ;-)
— J