Contributed by tbert on from the decoded-symbols dept.
Theo de Raadt (deraadt@) penned a missive titled "On the matter of strlcpy/strlcat acceptance by industry":
From time to time, there are people who say that strlcpy and strlcat are stupid. This is a little frustrating because we just want developers to have an easier time writing/auditing string code to avoid overflows and truncations, especially considering so many standard C APIs require fixed length strings or have other limits, and will in the forceable future. You probably all know about the mainstream users of these functions, like the Linux kernel, or MacOS, or the other BSD's, and Solaris. But there are many, many more, and it is time to show the global strlcpy'ing deniers the reality. I've collected some statistics to see how much upstream software use these functions.
The (elided) rest of the message below the fold; the full lists of software can be found at the link to the mailing list archive.
I asked Stuart Henderson to collect a "recursive nm .o" for every piece of software built in our ports tree. It's roughly 2GB of text output. For those who don't know, that ports tree is basically a repository of all the application software we supply as an add-on on top of the base operating system. Each of those becomes a package, so that is what we are looking at. They are pretty much the bulk of the commonly-used Unix applications found on all systems. These packages do not generally include things like openssh, perl, or X11, sqlite, or a number of other small things directly integrated into the OpenBSD base. But that's OK, because those I just mentioned do use strlcpy and strlcat in their upstream repositories. So 3535 packages contain .o files, and now we can grep to see what they define or use. In essence, a piece of software will likely fall into one of these catagories: (0) Not use the functions at all. (1) Will assume that the system has the functions in libc. (2) Will have a configure-style "feature-test" which tests if libc contains the functions, and thus turn on a cpp symbol such as HAS_STRLCPY, then use the libc version. Otherwise it will avoid using them... (3) More commonly, if the feature-test fails, it will substitute copies from its own tree. Essentially to cope with glibc. (4) Some software contain their own version, typically copied from us, but renamed. There are many of these. Let's look at these cases backwards, for reasons that become obvious as we move ahead. (4) Who is defining their own versions of the functions, with slightly different names? The obvious names we find are: SDL_strlcpy SDL_utf8strlcpy _iodbcdm_strlcpy _strlcpy ascii_safe_strlcpy av_strlcpy cli_strlcpy dt_utf8_strlcpy fc_strlcpy fl_strlcpy flac__strlcpy fz_strlcpy g_strlcpy hd_strlcpy isc_string_strlcpy lg_strlcpy llvm_strlcpy loud_strlcpy mcs_strlcpy mg_strlcpy monoeg_g_strlcpy mowgli_strlcpy my_strlcpy mystrlcpy os_strlcpy pa_strlcpy rb_strlcpy sg_strlcpy sl_strlcpy sm_strlcpy test_evutil_strlcpy test_strlcpy tr_strlcpy ut_strlcpy utf8_strlcpy uv_strlcpy vi_strlcpy xstrlcpy zbx_strlcpy SDL_strlcat SDL_strlcpy _iodbcdm_strlcat av_strlcat fc_strlcat fl_strlcat flac__strlcat fz_strlcat g_strlcat hd_strlcat isc_string_strlcat ixp_strlcat mcs_strlcat mowgli_strlcat mystrlcat rb_strlcat sg_strlcat sl_strlcat sm_strlcat ssh_strlcat uv_strlcat vi_strlcat wmii_strlcat xstrlcat zbx_strlcat Replacement copies seem to be quite popular. Some of the names hint at who is doing this, but we can search by these functions to see which packages are defining them: bogofilter bro clamav cntlm cups-filters darktable dkim-milter ffmpeg flac fltk freeciv fte glib2 gtk-gnutella htmldoc iodbc ircd-ratbox isc-bind isc-dhcp ksh93 leafnode libixp libstatgrab link-grammar linkchecker llvm mathomatic mcs mono mowgli mupdf mysql node pmacct postgresql pulseaudio rlwrap samhain sdl2 tcpreplay transmission visitors wmii wpa_supplicant xfe xpilot zabbix So 73 (2% or 3535) of packages define either of these for themselves under a new name. This may seem like a small list, but look it contains monsters like glib2, postgresql, and mysql. In particular, those monster contain libraries.. this will become more obvious a bit further on. (3) What about software which substitutes their own, when they don't find ours? This is harder to determine in the OpenBSD ports tree because our libc functions will always be found. However, we can see if any ports sloppily compile their own versions, even though we have it... databases/pgpool: T strlcpy devel/p5-File-RsyncP: T strlcpy devel/py-setproctitle: T strlcpy editors/fte: T strlcpy games/oolite: T strlcpy games/stone-soup: T strlcpy games/xpilot: T strlcpy mail/akpop3d: T strlcpy net/bro: T strlcpy net/tcpreplay: T strlcpy shells/ksh93: T strlcpy www/cntlm: T strlcpy www/linkchecker: T strlcpy x11/xfe: T strlcpy editors/fte: T strlcat games/xpilot: T strlcat net/bro: T strlcat net/pmacct: T strlcat net/tcpreplay: T strlcat shells/ksh93: T strlcat www/cntlm: T strlcat www/linkchecker: T strlcat x11/xfe: T strlcat This was rather unexpected. These software teams have decided to simply use the same name, for (hopefully) the same functionality. (2) Regarding the question of code which uses a feature test to find if the functions exist, and having not found them, then avoids them? We cannot test using the "symbol table" method. A test would need to be run on a system without the functions in libc. That test cannot be run on a BSD, MacOS, or Solaris... (1) The question of which ports use the functions in libc should really be split into two questions. How many use our functions (strlcpy and strlcat)? How many use the renamed functions (for instance, g_strlcpy from glib, isc_string_strlcpy, etc). The following 254 (7% of 3535) of packages use our strlcpy: [list of software elided] The following 158 (4% of 3535) of packages use our strlcat: [list of software elided] The following 326 (9% of 3535) packages use another library's private *strlcpy function: [list of software elided] The following 35 (1% of 3535) packages use another library's private *strlcat function: bitlbee chromium darktable dkim-milter eboard ffmpeg flac freeciv gcompris gecko-mediaplayer gmtk gnome-mplayer gtk-gnutella gtkpod htmldoc inkscape iodbc ircd-ratbox jnettop libstatgrab mcs mplayer mupdf ncmpc osmo pidgin qemu rlwrap samhain scmpc ufraw uim wmii xmms2 zabbix (0) Finally, we should answer the question about who is not using these functions or variants. Let us keep the answer really simple. The following 1808 (51% of 3535) packages use strcpy: [list of software elided] I'm not going to bother including the data for strcat. So 50% of software still calls strcpy. There is no way they have all been audited to avoid overflow. Following this, a few more observations are in order: (1) Remarkably, four pieces off software still use gets(3) chipmunk Wnn alpine metamail (2) sprintf is still pretty popular. 1810 (51% of 3535) packages use it. [list of software elided] Quite worrying. The odds of overflow or truncation are very high. (2) The above sprintf numbers are quite worrying. On the bright side, snprintf utilization is probably better than a few years ago. 1810 (38% of 3535) of packages use it. [list of software elided] Finally, I would like to take this opportunity to remind everyone of this piece from the strlcpy(3) manual page found at http://www.openbsd.org/cgi-bin/man.cgi?query=strlcpy [...] RETURN VALUES Besides quibbles over the return type (size_t versus int) and signal handler safety (snprintf(3) is not entirely safe on some systems), the following two are equivalent: n = strlcpy(dst, src, len); n = snprintf(dst, len, "%s", src); Like snprintf(3), the strlcpy() and strlcat() functions return the total length of the string they tried to create. For strlcpy() that means the length of src. For strlcat() that means the initial length of dst plus the length of src. [...] snprintf, strlcpy, and strlcat are used in exactly the same way. Using .o file symbols like above does not prove to us whether people are using the APIs in the most careful way -- that would require a source code inspection. But to provide an example, bind9 contains 114 uses of snprintf which don't check the return value to spot truncation, with code like the following char buf[DNS_NAME_FORMATSIZE + sizeof(": TSIG ''")]; [...] char namebuf[DNS_NAME_FORMATSIZE]; dns_name_format(&zone->tsigkey->name, namebuf, sizeof(namebuf)); snprintf(buf, sizeof(buf), ": TSIG '%s'", namebuf); Fine, maybe it is safe, of the "it has been audited, and next time someone is here, they will audit it again". I also don't have time to verify this or the 113 other cases, nor is it my job. I bring this up to ask why strlcpy/strlcat are being held to some arbitrary standard that they should handle truncation better .. when it is the case that it is handling it JUST LIKE the commonplace snprintf API. Right here in mainstream code, we see that snprintf's return is not being handled, against best practice taught everywhere. Should snprintf call abort? That's ridiculous. Should it crash? What should it do? The fact that no other function of that sort has ever made it into the mainstream perhaps shows the arguments are weak. If something is better, take some real software and fix it. To upstream authors of software who are using the functions: please continue incorporating more of them into your software, because it is good for the users of your software. Please check the return values to spot truncation as described the manual page, and properly handle that condition in the best way you can based on the location of the call. Thanks!
(Comments are closed)
By sneaker (sneaker) sneaker@noahpugsley.net on