xterm(1) now UTF-8 by default

Contributed by tj on 2016-03-08 from the xterminating-bugs dept.

For safety and usability, xterm(1) now uses UTF-8 mode by default.

CVSROOT:	/cvs
Module name:	xenocara
Changes by:	schwarze@cvs.openbsd.org	2016/03/08 10:26:30

Modified files:
	app/xterm      : XTerm.ad 

Log message:
Use UTF-8 mode by default because it is safer and more useful
even for people always running with a C/POSIX locale(1).
OK matthieu@ naddy@ martijn@

Ingo Schwarze (schwarze@) writes in to explain this change and how it improves security.

If two programs communicating encoded character strings to each other disagree about the encoding, that can result in problems.
One particular example of such communication is an application program passing output text to a terminal emulator program. If the terminal uses a different encoding for decoding the text than the application used for encoding it, the terminal may see control codes where the application only intended printable characters. This can screw up the terminal state, spoiling display of subsequent text or even hanging the terminal.
Actually, i assume that this problem occurs frequently in practice, for the following reasons. If the application program is well-behaved, it either produces C/POSIX/US-ASCII output only, or its idea of the encoding to use is governed by the LC_CTYPE locale(1) environment variable, typically passed to it by the shell it was started from. Now that locale(1) environment is completely unrelated to whatever encoding the terminal may be set up for. It may not even be on the same physical machine. For example, during an SSH session, your terminal is on the local SSH client machine, while the shell starting your application programs is on the remote SSH server machine. To fully appreciate the implications, try out the following scenario: Start an xterm(1) that is not UTF-8 enabled on your local machine by saying xterm +lc +u8. Unset LC_ALL, LC_CTYPE, and LANG; check with locale(1) that your locale is "C". Use ssh(1) to connect to a remote machine. Now simulate a program producing UTF-8 output on the remote machine, for example U+00DF LATIN SMALL LETTER SHARP S:
$ printf "\303\237\n"  # thanks to sobrado@ for the striking example
Now your local terminal hangs until you force a reset using the menus of the xterm program, because the '\237' byte appearing in the UTF-8 encoding of that LATIN SMALL LETTER SHARP S also is the ISO 6429 C1 control code "application program command" - it doesn't do anything useful in xterm(1), but causes subsequent bytes to be ignored until you send the "string terminator" byte '\234', which you probably won't ever do. There are literally hundreds of different control sequences that terminals may or may not respect, some more or less univeral, some highly specific for certain types of terminals, changing fonts, colors, encodings, window titles, moving windows around and resizing them, some even changing keyboard bindings, and many, many more things - some of which may actually be dangerous depending on what exactly you are using your terminal for. If the shell startup files on the remote machine set LC_CTYPE=en_US.UTF-8 or something similar by default, programs on the remote machine will always do just that, send UTF-8 encoded output over the wire that can utterly confuse your local terminal.
That shows how easy it is to inadvertently cause application-terminal character encoding mismatches; yet i doubt that many people are aware of the problem. So we should try to reduce the likelihood that people get burnt by such effects.
On an operating system supporting any third locale in addition to C/POSIX and UTF-8, people are screwed beyond rescue because even if one side of the connection assumes US-ASCII, communication is still unsafe in both directions. Reinterpreting US-ASCII in an arbitrary encoding and reinterpreting an arbitrary encoding as US-ASCII may both turn innocuous printable characters into dangerous terminal control codes. That is particularly bitter because some programs will always output US-ASCII, which is not safe to display in a terminal set up for an arbitrary locale.
Fortunately, in OpenBSD, we made the decision to only support exactly two locales, C/POSIX and UTF-8, and this combination has the following properties:

Printing unsanitized strings to the terminal is never safe, no matter the locale and terminal setup (think of cat /bsd).
Printing sanitized US-ASCII to a US-ASCII terminal is safe.
Printing sanitized UTF-8 to a UTF-8 terminal is safe.
Printing sanitized US-ASCII to a UTF-8 terminal is safe. That is important because there are some programs that we may never want to add UTF-8 support to.
However:
Printing sanitized UTF-8 to a US-ASCII terminal is *NOT* safe. Remember the example above that hung a US-ASCII terminal by printing U+00DF LATIN SMALL LETTER SHARP S in UTF-8 to it.
Until this week, our xterm(1) ran in US-ASCII mode by default. In view of the above, that was a terrible idea, even if the user didn't intend to ever use UTF-8. A UTF-8 terminal handles the US-ASCII the user wants just fine, and in addition to that, and mostly for free, it is more resilient against stray UTF-8 sneaking in.
Actually, even when fed garbage or unsupported encodings, a UTF-8 xterm(1) is more robust than a US-ASCII xterm(1) because the UTF-8 xterm(1) honours *fewer* terminal escape codes than the US-ASCII xterm(1). That may seem surprising at first because Unicode defines *more* control characters than US-ASCII does. But as explained on
http://invisible-island.net/xterm/ctlseqs/ctlseqs.html
xterm(1) never treats decoded multibyte characters as terminal control codes, so the ISO 6429 C1 control codes do not take effect in UTF-8 mode; but they do take effect in US-ASCII mode, even though they fall outside the scope of ASCII.
Consequently, in the interest of safe and sane defaults, i recently switched our xterm(1) to enable UTF-8 mode by default. I did that by adding this resource to /usr/X11R6/share/X11/app-defaults/XTerm:
*locale: UTF-8
The main goal is improving robustness. But it also improves usability. If you usually run your shells inside xterm(1) in C/POSIX mode, there should be few visible changes for you. But if you ever stumble upon a directory containing UTF-8 filenames, you can simply say
$ LC_CTYPE=en_US.UTF-8 ls
which would have given you garbage output in the past, and which just works now in OpenBSD-current.
If you really insist on running xterm(1) in traditional 8-bit character mode by default like in the past - which, nota bene, isn't quite C/POSIX/US-ASCII but does many additional things you are probably unaware of - you can do so in any of the following ways. But i do not recommend that at all, there are hardly any sane use cases - maybe except using weird, probably unsafe software that insists on sending ISO 6429 C1 controls in 8-bit mode rather than encoding them as two-byte sequences with the ASCII ESCAPE character as most software implementing terminal control via terminal control codes does. If you insist against all advice, you can:
Add XTerm*locale: true to your ~/.Xresources file, or use the -lc command line option for the same effect. That will also use UTF-8 mode, but use luit(1) to transform US-ASCII to UTF-8 on input which is probably mostly a NOOP, but might expose some subtle differences. Not recommended.
Add XTerm*locale: false to your ~/.Xresources file, or use the +lc command line option for the same effect. That will inspect LC_CTYPE in the environment and use UTF-8 mode if that specifies a UTF-8 locale, and traditional 8-bit character mode otherwise. Don't forget to run xrdb ~/.Xresources after editing the file.
Add XTerm*locale: medium to your ~/.Xresources file, to get exactly the old defaults back. They do weird things, read the source code in charproc.c, function VTInitialize_locale(), lines 7385-7404 for details. Not recommended.
Setting XTerm*locale to any other specific locale or using the -en command line option is accepted, too, but doesn't make much sense because OpenBSD does not support any other locales in the first place.
If you encounter any problems, do not hesitate to tell me.
Thanks to Igor Sobrado@ for bringing the problem to my attention and to Christian "naddy@" Weisgerber for suggesting to do the switch by changing the file /usr/X11R6/share/X11/app-defaults/XTerm.
One final word of caution. Do not use this non-standard default setting on any other system except OpenBSD. It only works because OpenBSD deliberately does not support any locales except UTF-8 and C/POSIX/US-ASCII. Terrible things will happen if you force the default to UTF-8 in this way on a system where people can actually opt into multibyte locales that differ from UTF-8.
On other systems, there is no way in hell to make the interaction of locales with terminal controls truly safe.
Specifically, when talking about SSH connections, the only case where you can stop worrying about locales is when connecting from OpenBSD to OpenBSD and only when the client side is running in an xterm(1) with this patch. Any connection involving any other operating system is unsafe, even if you don't plan to intentionally transfer non-ASCII characters, even if you know that any one of the two sides is set to C/POSIX/US-ASCII mode, unless you *manually* make sure that the character encoding locales on both sides of the connection agree. I doubt that many people have developed a habit of checking that manually for each and every SSH connection they do before starting programs like ls(1) on the remote side. I doubt that many people even know how to check which mode their local xterm(1) is running in - hint: looking at LC_CTYPE has nothing to do with it; you have to hold the Ctrl-Key, click the right mouse button inside the xterm(1) window, and look at the "fonts" menu, even though this question isn't about fonts at all. If there is a tick mark in front of the "UTF-8 Encoding" menu entry, you are in UTF-8 mode, otherwise, you are in traditional 8-bit character mode.
I think it's good that at least any OpenBSD-current to OpenBSD SSH connections are safe in this respect from now on, and that people can safely switch back and forth LC_CTYPE to their heart's content in any xterm(1) using the default configuration from now on, and that people no longer need to worry about locales in this respect at least. But connecting to and from other operating systems still needs caution...

Thanks for the detailed explanation, Ingo!

(Comments are closed)

Comments

By Just Another OpenBSD User (77.85.135.0) on 2016-03-09 08:57

OpenBSD is very comfortable on the desktop. Thanks even more for the actual work on bringing UTF-8 closer and safer to the masses. And the docs that save bits in our brains, is incredibly useful and quick to grasp. Mental maintenance is minimal, it just works.
By grey (grey) grey@artkiver.com on 2016-03-10 02:49 After a hiatus beginning in 2006, I have returned in 2016. BofH of the undead.

I wonder if I am doing something incorrectly, this is on a VM running a snapshot from March 9th.

On the left is an xterm, the right is Terminal.app from OS X:

http://i.imgur.com/EAQPM2E.png

The sanskrit is just a test script with an old vedic prayer in Devanagari

I spend a lot more time working with Japanese, but for the sake of testing UTF-8, still comes in handy.
Comments
1. By rjc (rjc) on 2016-03-10 10:41
  
  > I wonder if I am doing something incorrectly, this is on a VM running a snapshot from March 9th.
  >
  > On the left is an xterm, the right is Terminal.app from OS X:
  >
  > http://i.imgur.com/EAQPM2E.png
  >
  > The sanskrit is just a test script with an old vedic prayer in Devanagari
  >
  > I spend a lot more time working with Japanese, but for the sake of testing UTF-8, still comes in handy.
  
  Do you have the appropriate fonts available on your system?
  Comments
  1. By grey (grey) on 2016-03-10 18:30
    
    > Do you have the appropriate fonts available on your system?
    
    I am uncertain, I did not see much mention of this in the article, nor in the FAQ http://www.openbsd.org/faq/faq11.html
    
    If you have any recommendations on which font to use and how to go about configuring it to address this, they would be welcome.
    
    Thank you for the suggestion, I'll continue reading X11 documentation as well.
    
    Comments
    
    By Just Another OpenBSD User (77.85.135.0) on 2016-03-10 18:46
    
    > If you have any recommendations on which font to use
    
    http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html
    
    On OpenBSD it works out of the box, you need not configure anything to get it. Just make sure you use UTF-8 in xterm(1) and set your LOCALE appropriately if not already running the latest snapshot.
    
    > Thank you for the suggestion, I'll continue reading X11 documentation as well.

Latest Articles

Sat, Jul 05
- 08:17 KDE Plasma 6.4 has landed in OpenBSD (0)
- 08:13 Blink and you'll miss it! 4096 colours and flashing text on the console! (2)
- 08:08 Game of Trees Hub now taking signups for repository hosting (0)
Sat, Jun 28
- 05:57 Game of Trees 0.115 released (0)
Tue, Jun 24
- 07:48 Game of Trees 0.114 released (0)
- 07:23 Call for testing: bge/bnx/iavf/igc/ix/ixl/ngbe/pcn: ifq_restart() fix (0)
Mon, Jun 16
- 08:22 j2k25 hackathon report from kn@: installer, low battery, and more (0)
Fri, Jun 13
- 11:18 dhcpd(8): use UDP sockets instead of BPF (1)
Thu, Jun 12
- 12:32 clang(1)/llvm/lld(1) updated to version 19 (0)

Credits

Copyright © 2004-2008 Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to April 2nd 2004 as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]