Ingo Schwarze (schwarze@) writes in to explain this change and how it improves security.
If two programs communicating encoded character strings to each other
disagree about the encoding, that can result in problems.
One particular example of such communication is an application program
passing output text to a terminal emulator program. If the terminal
uses a different encoding for decoding the text than the application
used for encoding it, the terminal may see control codes where the
application only intended printable characters. This can screw up the
terminal state, spoiling display of subsequent text or even hanging
Actually, i assume that this problem occurs frequently in practice,
for the following reasons. If the application program is well-behaved,
it either produces C/POSIX/US-ASCII output only, or its idea of the
encoding to use is governed by the LC_CTYPE locale(1) environment
variable, typically passed to it by the shell it was started from.
Now that locale(1) environment is completely unrelated to whatever
encoding the terminal may be set up for. It may not even be on the
same physical machine. For example, during an SSH session, your
terminal is on the local SSH client machine, while the shell starting
your application programs is on the remote SSH server machine.
To fully appreciate the implications, try out the following scenario:
Start an xterm(1) that is not UTF-8 enabled on your local machine
by saying xterm +lc +u8. Unset LC_ALL, LC_CTYPE, and LANG; check
with locale(1) that your locale is "C". Use ssh(1) to connect to
a remote machine. Now simulate a program producing UTF-8 output
on the remote machine, for example U+00DF LATIN SMALL LETTER SHARP S:
$ printf "\303\237\n" # thanks to sobrado@ for the striking example
Now your local terminal hangs until you force a reset using the
menus of the xterm program, because the '\237' byte appearing in
the UTF-8 encoding of that LATIN SMALL LETTER SHARP S also is the
ISO 6429 C1 control code "application program command" - it doesn't
do anything useful in xterm(1), but causes subsequent bytes to be
ignored until you send the "string terminator" byte '\234', which
you probably won't ever do. There are literally hundreds of different
control sequences that terminals may or may not respect, some more
or less univeral, some highly specific for certain types of terminals,
changing fonts, colors, encodings, window titles, moving windows
around and resizing them, some even changing keyboard bindings, and
many, many more things - some of which may actually be dangerous
depending on what exactly you are using your terminal for. If the
shell startup files on the remote machine set LC_CTYPE=en_US.UTF-8
or something similar by default, programs on the remote machine
will always do just that, send UTF-8 encoded output over the wire
that can utterly confuse your local terminal.
That shows how easy it is to inadvertently cause application-terminal
character encoding mismatches; yet i doubt that many people are aware
of the problem. So we should try to reduce the likelihood that people
get burnt by such effects.
On an operating system supporting any third locale in addition to
C/POSIX and UTF-8, people are screwed beyond rescue because even
if one side of the connection assumes US-ASCII, communication is
still unsafe in both directions. Reinterpreting US-ASCII in an
arbitrary encoding and reinterpreting an arbitrary encoding as
US-ASCII may both turn innocuous printable characters into dangerous
terminal control codes. That is particularly bitter because some
programs will always output US-ASCII, which is not safe to display
in a terminal set up for an arbitrary locale.
Fortunately, in OpenBSD, we made the decision to only support exactly
two locales, C/POSIX and UTF-8, and this combination has the following
Printing unsanitized strings to the terminal is never safe,
no matter the locale and terminal setup (think of cat /bsd).
Printing sanitized US-ASCII to a US-ASCII terminal is safe.
Printing sanitized UTF-8 to a UTF-8 terminal is safe.
Printing sanitized US-ASCII to a UTF-8 terminal is safe.
That is important because there are some programs that we may
never want to add UTF-8 support to.
Printing sanitized UTF-8 to a US-ASCII terminal is *NOT* safe.
Remember the example above that hung a US-ASCII terminal by
printing U+00DF LATIN SMALL LETTER SHARP S in UTF-8 to it.
Until this week, our xterm(1) ran in US-ASCII mode by default. In
view of the above, that was a terrible idea, even if the user didn't
intend to ever use UTF-8. A UTF-8 terminal handles the US-ASCII
the user wants just fine, and in addition to that, and mostly for
free, it is more resilient against stray UTF-8 sneaking in.
Actually, even when fed garbage or unsupported encodings, a UTF-8
xterm(1) is more robust than a US-ASCII xterm(1) because the UTF-8
xterm(1) honours *fewer* terminal escape codes than the US-ASCII
xterm(1). That may seem surprising at first because Unicode defines
*more* control characters than US-ASCII does. But as explained on
xterm(1) never treats decoded multibyte characters as terminal
control codes, so the ISO 6429 C1 control codes do not take effect
in UTF-8 mode; but they do take effect in US-ASCII mode, even though
they fall outside the scope of ASCII.
Consequently, in the interest of safe and sane defaults, i recently
switched our xterm(1) to enable UTF-8 mode by default. I did that
by adding this resource to /usr/X11R6/share/X11/app-defaults/XTerm:
The main goal is improving robustness. But it also improves
usability. If you usually run your shells inside xterm(1) in C/POSIX
mode, there should be few visible changes for you. But if you
ever stumble upon a directory containing UTF-8 filenames, you can
$ LC_CTYPE=en_US.UTF-8 ls
which would have given you garbage output in the past, and which
just works now in OpenBSD-current.
If you really insist on running xterm(1) in traditional 8-bit
character mode by default like in the past - which, nota bene, isn't
quite C/POSIX/US-ASCII but does many additional things you are
probably unaware of - you can do so in any of the following ways.
But i do not recommend that at all, there are hardly any sane use
cases - maybe except using weird, probably unsafe software that
insists on sending ISO 6429 C1 controls in 8-bit mode rather than
encoding them as two-byte sequences with the ASCII ESCAPE character
as most software implementing terminal control via terminal control
codes does. If you insist against all advice, you can:
Add XTerm*locale: true to your ~/.Xresources file,
or use the -lc command line option for the same effect.
That will also use UTF-8 mode, but use luit(1) to transform
US-ASCII to UTF-8 on input which is probably mostly a NOOP,
but might expose some subtle differences. Not recommended.
Add XTerm*locale: false to your ~/.Xresources file,
or use the +lc command line option for the same effect.
That will inspect LC_CTYPE in the environment and use UTF-8 mode
if that specifies a UTF-8 locale, and traditional 8-bit
character mode otherwise.
Don't forget to run xrdb ~/.Xresources after editing the file.
Add XTerm*locale: medium to your ~/.Xresources file,
to get exactly the old defaults back. They do weird things,
read the source code in charproc.c, function VTInitialize_locale(),
lines 7385-7404 for details. Not recommended.
Setting XTerm*locale to any other specific locale or using
the -en command line option is accepted, too, but doesn't
make much sense because OpenBSD does not support any other
locales in the first place.
If you encounter any problems, do not hesitate to tell me.
Thanks to Igor Sobrado@ for bringing the problem to my attention
and to Christian "naddy@" Weisgerber for suggesting to do the switch
by changing the file /usr/X11R6/share/X11/app-defaults/XTerm.
One final word of caution. Do not use this non-standard default
setting on any other system except OpenBSD. It only works because
OpenBSD deliberately does not support any locales except UTF-8
and C/POSIX/US-ASCII. Terrible things will happen if you force
the default to UTF-8 in this way on a system where people can
actually opt into multibyte locales that differ from UTF-8.
On other systems, there is no way in hell to make the interaction
of locales with terminal controls truly safe.
Specifically, when talking about SSH connections, the only case
where you can stop worrying about locales is when connecting from
OpenBSD to OpenBSD and only when the client side is running in an
xterm(1) with this patch. Any connection involving any other
operating system is unsafe, even if you don't plan to intentionally
transfer non-ASCII characters, even if you know that any one of the
two sides is set to C/POSIX/US-ASCII mode, unless you *manually*
make sure that the character encoding locales on both sides of the
connection agree. I doubt that many people have developed a habit
of checking that manually for each and every SSH connection they do
before starting programs like ls(1) on the remote side. I doubt
that many people even know how to check which mode their local xterm(1)
is running in - hint: looking at LC_CTYPE has nothing to do with it;
you have to hold the Ctrl-Key, click the right mouse button inside
the xterm(1) window, and look at the "fonts" menu, even though this
question isn't about fonts at all. If there is a tick mark in front
of the "UTF-8 Encoding" menu entry, you are in UTF-8 mode, otherwise,
you are in traditional 8-bit character mode.
I think it's good that at least any OpenBSD-current to OpenBSD SSH
connections are safe in this respect from now on, and that people
can safely switch back and forth LC_CTYPE to their heart's content
in any xterm(1) using the default configuration from now on, and
that people no longer need to worry about locales in this respect
at least. But connecting to and from other operating systems still