g2k18 hackathon report: Ingo Schwarze on sed(1) bugfixing with Martijn van Duren, and about other small userland stuff

Contributed by rueda on 2018-07-27 from the rabbit-holes-and-caves dept.

For the g2k18 Ljubljana hackathon, i decided to try and get rid of as many small userland tasks as possible. Lots of them have been piling up over time.

Hacking sed(1) with martijn@

The cooperation with Martijn@ van Duren was particularly fruitful, even though almost all of it ended up being on sed(1). Initially, we suspected multiple bugs in that utility, so the hackathon began with comparing the POSIX specification, the OpenBSD manual page, the OpenBSD implementation, and other implementations. In the end, it turned out only two of the observed oddities were outright bugs, another one dubious behaviour that we decided to improve, and all the rest merely required a better description in the manual page.

First bug: When using the opening square bracket ('[') as the regular expression (RE) delimiter in a context address or in the substitute ('s') command, as in "\[RE1[s[RE2[replacement[", an escaped opening square bracket ("\[") contained in the RE2 failed to be interpreted as the beginning of a bracket expression and was instead treated as a literal character matching itself. The fact that abusing '[' as a RE delimiter is surely a bad idea does not excuse bugs in the implementation, so we fixed the bug. Martijn@ raised the original question, but staring at the code and the documentation together, we isolated the bug from other nearby quirks and designed a much smaller and simpler fix than he originally intended: sed(1) is not quite the program you want to break by excessive changes to its innards.
After this bugfix, i also had to fix a related bogus regression test, and i added more tests checking unusual RE delimiters.
Second bug: In the same situation (RE delimited by '[' containing a bracket expression), if the bracket expression contains an opening square bracket, that bracket required escaping with a backslash, even though POSIX clearly says that inside bracket expressions, escaping is not possible and both '[' and '\' represent themselves. I did not manage to get the required OK for that bugfix yet, probably because the minimal test cases, "s[\[[][R[g" and "s[\[\[][R[g", admittedly look scary and make developers avert their eyes in disgust.
A suspected third bug, "s/c/\1/" erroring out even though POSIX says if the corresponding back-reference expression does not match, then the characters "\n" shall be replaced by the empty string, turned out not to be a bug — a non-existentent subexpression is not the same as a subexpression that does not match. The sentence in POSIX is intended for cases like "s/(a)|c/\1/g", which does result in pasting an empty string.
Much of the confusion resulted from sloppy descriptions in the sed(1) manual page, which mixed the content of the paragraphs about adresses, regular expressions, and substitution commands in weird ways. Martijn@ and myself stared at the manual together and disentangled the three descriptions, also adding several missing details.
Similarly, the meaning of a backslash in the replacement string used to be incompletely described. After investigating together, martijn@ committed a complete description that isn't even significantly longer.
Also, martijn@ had found that the sed(1) list ('l') command selected a default width that was exactly one column too wide for the terminal. He wrote a patch that reduced the width in the same way already used in ed(1), i checked it, and he committed it.
This change also required that i adjusted the regression suite.
Finally, martijn@ had written a patch to improve const-correctness in the libc/regex engine code, which lacked "const" on many string arguments and variables never changed by the respective functions and in one place did a an ugly cast from (char *) to (const char *) — such casts are potentially dangerous because it is easy to overlook in auditing if they not only add "const", but also inadvertently change the type. I checked his patch, recommended one minor tweak, and he committed it.

UTF-8 in small utilities

I also intended to continue my work on improving UTF-8 support in small userland utilities, but time was barely sufficient to get a single one done: lam(1). It first required a bug fix before i could even start because the -p/-P option got broken by a refactoring commit fourteen years ago (sic!). After that, i tightened the pledge(2), wrote a small regression suite, and finally sent out a patch doing proper columnation with wcwidth(3).

The next utility i considered turning to for UTF-8 support was pr(1), but i was quickly stricken by dread: On the one hand, it is no longer the most useful utility nowadays; not sure why, but fixed character width daisy-wheel and dot-matrix line printers seem to have falled somewhat out of users' favour lately. Yet it is still in POSIX 2008! On the other hand, the options have a typical kitchen-sink design, making the thought of touching it somewhat unpleasant…

Given that it poses similar tasks as lam(1), only in a more entangled and less clean codebase, and of possibly even lower overall usefulness, and given that fixing lam(1) already cost more than a day, i quickly decided to let it be for now.

Leaving the relatively few large open UTF-8 tasks (vi, mg, regex, ksh, …) aside, it is not a coincidence that the remaining small tasks reside in arcane corners and are surprisingly tedious at this point: we proceeded from the more important to the less important and from the easier to the more complicated. Yet it is annoying that more than a dozen small programs still remain that have minor UTF-8 issues. I'm not sure how to get that fixed without wasting undue effort. I guess i shall at least make a public list of known issues at some point.

Odds and ends

So in the end, i turned to cleaning up leftovers from another change done some time ago: the removal of networks(5) support. Specifically, i deleted support for the following archaic notation of named networks from the route(8) program: now, it no longer interprets "0.192.168.4" in hosts(5) as "192.168.4/24", eliciting the disgusted exclamation "Steh nicht rum, committe das!" from Henning@ Brauer.

Right afterwards, i discussed further directions with the "routing table" (i.e. the table where claudio@, benno@, florian@, phessler@, henning@, and bluhm@ did their hacking), but the hackathon was over before i managed to get the next patch tested.

Of course, for me, work on documentation never stops. During the hackathon, i imported the new ASN1_INTEGER_get(3) manual page, finally committed my improvements to the bioctl(8) manual, helped Theo Bühler (tb@) to get in changes to the tls_init(3), tls_connect(3), and EC_POINT_add(3) manuals, provided an OK to Gonzalo@ Rodriguez to explain httpd.conf(5) in the README file of the devel/cvsweb port, and some minor bits.

Stuff that got stuck

Unfortunately, several patches that i looked at during the hackathon are still blocked and cannot go in yet:

Martijn@ wrote a patch for ed(1) to adjust the behaviour of its substitute ('s') command to what sed(1) does — however, it cannot go in yet because it still lacks the latest sed(1) bugfix. Also, the code in ed(1) gratuitiously differs from the clearer code in sed(1), and the patch doesn't harmonize that yet.
Martijn@ wrote a patch to adjust realpath(3) to POSIX in so far that it ought to error out on non-existent files even if all the required directories exist. I spent some time looking into that matter, but the patch wasn't committed yet because making sure it doesn't cause regressions neither in base nor in ports requires tedious work.
I had hoped to finally get some UTF-8 support in for tr(1): martijn@ had been sending drafts of partial patches for that for about two years, so we spent some time discussing the matter and marveling at the wonders of its POSIX specification. Yet the deeper we sent our rabbits down that hole, the more worms they encountered. Finally, we were forced to conclude that even drafting a complete plan of which aspects of POSIX can and which cannot be implemented in OpenBSD, and which basic algorithms are candidates for the features that can be implemented, would require more work than could possible fit into the rest of the hackathon, let alone getting started with the implementation. So we agreed to defer the whole matter. Other developers had already run away from the topic in dismay, and they cannot be blamed: there are too many pitfalls to even try to list them here.
Job@ Snijders sent a patch for aligning to the right margin in column(1). I audited that patch but it cannot go in yet because it re-parses the command line arguments over and over again for each and every table cell.

To summarize, the approach of trying to fix lots of small things in userland during a hackathon was only partially successful. Even seemingly small tasks tend to consume considerable time until you understand the exact specification of the utility, its codebase, the precise nature of the bug and how to fix it, and how the particular regression suite works. The more than a dozen of hackathons that i spent on mandoc(1) were considerably more efficient, even those mostly spent on bugfxing, because i more or less know the specification by heart and also know the code base and regression suite quite well. Then again, lone wolf hacking like on mandoc(1) doesn't really require a hackathon environment.

In the end, i still have no idea how to get all the small userland stuff taken care of. The g2k18 experience is that a hackathon is simply too short to get much done in this area, but on a day-to-day basic, half of it gets buried and forgotten.

… and the non-geeky parts

Still, seeing all the other developers again is always a lot of fun, and of course meeting the new developers: for example, i accepted Remi@ Locherer's invitation to the Grafitti Sightseeing Tour with Špela, which was very interesting and provided many surprising insights. In general, Ljubljana is a highly likeable and lively city that i always enjoy coming back to. Then, i took naddy@'s and martijn@'s wish to go buy some Iced Earth concert tickets as a pretext for abducting them into corners of the city we would never have seen otherwise, including walking down the north face of Rožnik using a footpath so narrow that is marked on no map, and on Friday afternoon, when all the others had already left, i profited of the opportunity to re-visit my favourite Ljubljana hiking destination, Golovec Hill, which i discovered back in 2014 with Rapha@el Graf.

Many thanks to Mitja Muženič for flawlessly organizing yet another hackathon in Slovenia, including an excursion to the breathtaking Škocjanske jame!

(Comments are closed)

Comments

By Will Backman (bitgeist) on 2018-07-28 15:11

Thank you! All the little stuff adds up.

Latest Articles

Fri, Jul 11
- 09:15 watch(1) utility added to -current (0)
Sat, Jul 05
- 08:17 KDE Plasma 6.4 has landed in OpenBSD (0)
- 08:13 Blink and you'll miss it! 4096 colours and flashing text on the console! (2)
- 08:08 Game of Trees Hub now taking signups for repository hosting (0)
Sat, Jun 28
- 05:57 Game of Trees 0.115 released (0)
Tue, Jun 24
- 07:48 Game of Trees 0.114 released (0)
- 07:23 Call for testing: bge/bnx/iavf/igc/ix/ixl/ngbe/pcn: ifq_restart() fix (0)
Mon, Jun 16
- 08:22 j2k25 hackathon report from kn@: installer, low battery, and more (0)
Fri, Jun 13
- 11:18 dhcpd(8): use UDP sockets instead of BPF (1)

Credits

Copyright © 2004-2008 Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to April 2nd 2004 as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]