OpenBSD Journal
Home : : Add Story : : Archives : : About : : Create Account : : Login :
xterm(1) now UTF-8 by default
Contributed by tj on Tue Mar 8 20:41:19 2016 (GMT)
from the xterminating-bugs dept.

For safety and usability, xterm(1) now uses UTF-8 mode by default.

CVSROOT:	/cvs
Module name:	xenocara
Changes by:	schwarze@cvs.openbsd.org	2016/03/08 10:26:30

Modified files:
	app/xterm      : XTerm.ad 

Log message:
Use UTF-8 mode by default because it is safer and more useful
even for people always running with a C/POSIX locale(1).
OK matthieu@ naddy@ martijn@

Ingo Schwarze (schwarze@) writes in to explain this change and how it improves security.

If two programs communicating encoded character strings to each other disagree about the encoding, that can result in problems.

One particular example of such communication is an application program passing output text to a terminal emulator program. If the terminal uses a different encoding for decoding the text than the application used for encoding it, the terminal may see control codes where the application only intended printable characters. This can screw up the terminal state, spoiling display of subsequent text or even hanging the terminal.

Actually, i assume that this problem occurs frequently in practice, for the following reasons. If the application program is well-behaved, it either produces C/POSIX/US-ASCII output only, or its idea of the encoding to use is governed by the LC_CTYPE locale(1) environment variable, typically passed to it by the shell it was started from. Now that locale(1) environment is completely unrelated to whatever encoding the terminal may be set up for. It may not even be on the same physical machine. For example, during an SSH session, your terminal is on the local SSH client machine, while the shell starting your application programs is on the remote SSH server machine. To fully appreciate the implications, try out the following scenario: Start an xterm(1) that is not UTF-8 enabled on your local machine by saying xterm +lc +u8. Unset LC_ALL, LC_CTYPE, and LANG; check with locale(1) that your locale is "C". Use ssh(1) to connect to a remote machine. Now simulate a program producing UTF-8 output on the remote machine, for example U+00DF LATIN SMALL LETTER SHARP S:

$ printf "\303\237\n"  # thanks to sobrado@ for the striking example

Now your local terminal hangs until you force a reset using the menus of the xterm program, because the '\237' byte appearing in the UTF-8 encoding of that LATIN SMALL LETTER SHARP S also is the ISO 6429 C1 control code "application program command" - it doesn't do anything useful in xterm(1), but causes subsequent bytes to be ignored until you send the "string terminator" byte '\234', which you probably won't ever do. There are literally hundreds of different control sequences that terminals may or may not respect, some more or less univeral, some highly specific for certain types of terminals, changing fonts, colors, encodings, window titles, moving windows around and resizing them, some even changing keyboard bindings, and many, many more things - some of which may actually be dangerous depending on what exactly you are using your terminal for. If the shell startup files on the remote machine set LC_CTYPE=en_US.UTF-8 or something similar by default, programs on the remote machine will always do just that, send UTF-8 encoded output over the wire that can utterly confuse your local terminal.

That shows how easy it is to inadvertently cause application-terminal character encoding mismatches; yet i doubt that many people are aware of the problem. So we should try to reduce the likelihood that people get burnt by such effects.

On an operating system supporting any third locale in addition to C/POSIX and UTF-8, people are screwed beyond rescue because even if one side of the connection assumes US-ASCII, communication is still unsafe in both directions. Reinterpreting US-ASCII in an arbitrary encoding and reinterpreting an arbitrary encoding as US-ASCII may both turn innocuous printable characters into dangerous terminal control codes. That is particularly bitter because some programs will always output US-ASCII, which is not safe to display in a terminal set up for an arbitrary locale.

Fortunately, in OpenBSD, we made the decision to only support exactly two locales, C/POSIX and UTF-8, and this combination has the following properties:

  • Printing unsanitized strings to the terminal is never safe, no matter the locale and terminal setup (think of cat /bsd).
  • Printing sanitized US-ASCII to a US-ASCII terminal is safe.
  • Printing sanitized UTF-8 to a UTF-8 terminal is safe.
  • Printing sanitized US-ASCII to a UTF-8 terminal is safe. That is important because there are some programs that we may never want to add UTF-8 support to.

    However:

  • Printing sanitized UTF-8 to a US-ASCII terminal is *NOT* safe. Remember the example above that hung a US-ASCII terminal by printing U+00DF LATIN SMALL LETTER SHARP S in UTF-8 to it.

    Until this week, our xterm(1) ran in US-ASCII mode by default. In view of the above, that was a terrible idea, even if the user didn't intend to ever use UTF-8. A UTF-8 terminal handles the US-ASCII the user wants just fine, and in addition to that, and mostly for free, it is more resilient against stray UTF-8 sneaking in.

    Actually, even when fed garbage or unsupported encodings, a UTF-8 xterm(1) is more robust than a US-ASCII xterm(1) because the UTF-8 xterm(1) honours *fewer* terminal escape codes than the US-ASCII xterm(1). That may seem surprising at first because Unicode defines *more* control characters than US-ASCII does. But as explained on

    http://invisible-island.net/xterm/ctlseqs/ctlseqs.html

    xterm(1) never treats decoded multibyte characters as terminal control codes, so the ISO 6429 C1 control codes do not take effect in UTF-8 mode; but they do take effect in US-ASCII mode, even though they fall outside the scope of ASCII.

    Consequently, in the interest of safe and sane defaults, i recently switched our xterm(1) to enable UTF-8 mode by default. I did that by adding this resource to /usr/X11R6/share/X11/app-defaults/XTerm:

    *locale: UTF-8

    The main goal is improving robustness. But it also improves usability. If you usually run your shells inside xterm(1) in C/POSIX mode, there should be few visible changes for you. But if you ever stumble upon a directory containing UTF-8 filenames, you can simply say

    $ LC_CTYPE=en_US.UTF-8 ls

    which would have given you garbage output in the past, and which just works now in OpenBSD-current.

    If you really insist on running xterm(1) in traditional 8-bit character mode by default like in the past - which, nota bene, isn't quite C/POSIX/US-ASCII but does many additional things you are probably unaware of - you can do so in any of the following ways. But i do not recommend that at all, there are hardly any sane use cases - maybe except using weird, probably unsafe software that insists on sending ISO 6429 C1 controls in 8-bit mode rather than encoding them as two-byte sequences with the ASCII ESCAPE character as most software implementing terminal control via terminal control codes does. If you insist against all advice, you can:

  • Add XTerm*locale: true to your ~/.Xresources file, or use the -lc command line option for the same effect. That will also use UTF-8 mode, but use luit(1) to transform US-ASCII to UTF-8 on input which is probably mostly a NOOP, but might expose some subtle differences. Not recommended.
  • Add XTerm*locale: false to your ~/.Xresources file, or use the +lc command line option for the same effect. That will inspect LC_CTYPE in the environment and use UTF-8 mode if that specifies a UTF-8 locale, and traditional 8-bit character mode otherwise. Don't forget to run xrdb ~/.Xresources after editing the file.
  • Add XTerm*locale: medium to your ~/.Xresources file, to get exactly the old defaults back. They do weird things, read the source code in charproc.c, function VTInitialize_locale(), lines 7385-7404 for details. Not recommended.
  • Setting XTerm*locale to any other specific locale or using the -en command line option is accepted, too, but doesn't make much sense because OpenBSD does not support any other locales in the first place.

    If you encounter any problems, do not hesitate to tell me.

    Thanks to Igor Sobrado@ for bringing the problem to my attention and to Christian "naddy@" Weisgerber for suggesting to do the switch by changing the file /usr/X11R6/share/X11/app-defaults/XTerm.

    One final word of caution. Do not use this non-standard default setting on any other system except OpenBSD. It only works because OpenBSD deliberately does not support any locales except UTF-8 and C/POSIX/US-ASCII. Terrible things will happen if you force the default to UTF-8 in this way on a system where people can actually opt into multibyte locales that differ from UTF-8.

    On other systems, there is no way in hell to make the interaction of locales with terminal controls truly safe.

    Specifically, when talking about SSH connections, the only case where you can stop worrying about locales is when connecting from OpenBSD to OpenBSD and only when the client side is running in an xterm(1) with this patch. Any connection involving any other operating system is unsafe, even if you don't plan to intentionally transfer non-ASCII characters, even if you know that any one of the two sides is set to C/POSIX/US-ASCII mode, unless you *manually* make sure that the character encoding locales on both sides of the connection agree. I doubt that many people have developed a habit of checking that manually for each and every SSH connection they do before starting programs like ls(1) on the remote side. I doubt that many people even know how to check which mode their local xterm(1) is running in - hint: looking at LC_CTYPE has nothing to do with it; you have to hold the Ctrl-Key, click the right mouse button inside the xterm(1) window, and look at the "fonts" menu, even though this question isn't about fonts at all. If there is a tick mark in front of the "UTF-8 Encoding" menu entry, you are in UTF-8 mode, otherwise, you are in traditional 8-bit character mode.

    I think it's good that at least any OpenBSD-current to OpenBSD SSH connections are safe in this respect from now on, and that people can safely switch back and forth LC_CTYPE to their heart's content in any xterm(1) using the default configuration from now on, and that people no longer need to worry about locales in this respect at least. But connecting to and from other operating systems still needs caution...

  • Thanks for the detailed explanation, Ingo!

    [topicsecurity]

    << 5.9 songs released | Reply | Flattened | Expanded | The VAX platform is no more >>

    Threshold: Help

    Related Links
    more by tj


      Re: xterm(1) now UTF-8 by default (mod 9/53)
    by Just Another OpenBSD User (77.85.135.0) on Wed Mar 9 08:57:59 2016 (GMT)
      OpenBSD is very comfortable on the desktop. Thanks even more for the actual work on bringing UTF-8 closer and safer to the masses. And the docs that save bits in our brains, is incredibly useful and quick to grasp. Mental maintenance is minimal, it just works.
      [ Show thread ] [ Reply to this comment ] [ Mod Up ] [ Mod Down ]

      Re: xterm(1) now UTF-8 by default (mod -3/43)
    by grey (grey) (grey@artkiver.com) on Thu Mar 10 02:49:10 2016 (GMT)
    After a hiatus beginning in 2006, I have returned in 2016. BofH of the undead.
      I wonder if I am doing something incorrectly, this is on a VM running a snapshot from March 9th.

    On the left is an xterm, the right is Terminal.app from OS X:

    http://i.imgur.com/EAQPM2E.png

    The sanskrit is just a test script with an old vedic prayer in Devanagari

    I spend a lot more time working with Japanese, but for the sake of testing UTF-8, still comes in handy.
      [ Show thread ] [ Reply to this comment ] [ Mod Up ] [ Mod Down ]

      Re: xterm(1) now UTF-8 by default (mod -2/24)
    by robertss (104.223.185.184) on Thu Jun 16 02:50:04 2016 (GMT)
     
    Although, allotment <a href="http://www.stylecop.co.uk">Ray Ban Outlet uk</a> the actual lens and cast are appropriately important while affairs sunglasses, the accurate blazon is too significant. You may get a Ray Ban aviator but it just ability not be appropriate for your face shape. This is area the appearance of your face plays a atomic yet cogent part, i.e, allotment a brace of sunglasses. While some of you may be acquainted of your face appearance but not the blazon which would clothing it, on the added hand, absolutely a amount of humans accept adversity <a href="http://www.cheapoakleyslot.com">Oakley Outlet</a> in free their face shape. Simple amount comparisons: Analysis out the prices of altered sunglasses brands in India after affective from abundance to store. All you charge to do is seek for the sunglasses you wish and analyze prices with <a href="http://www.cheapsunglassesvogue.co.uk">ray ban sunglasses outlet</a> assorted altered retailers. With the abundant advances in agenda imaging you can even see what they attending like in abundant detail.Aside from top superior shades by arch designers, there are so abounding absolute producers and specialty appearance shops that backpack articles for every aftertaste and style. With so abounding options and looks, arcade online is consistently a acute choice. So, now the catechism absolutely is why should one boutique for branded sunglasses at http://www.stylecop.co.uk online?
      [ Show thread ] [ Reply to this comment ] [ Mod Up ] [ Mod Down ]

      Re: xterm(1) now UTF-8 by default (mod 0/24)
    by Leanne Best (27.255.190.39) (micaljorden00@zoho.com) on Wed Jun 29 11:44:10 2016 (GMT)
    http://www.huludb.com/
      Nice to be visiting your blog again, it has been months for me. Well this article that iíve been waited for so long. I need this article to complete my assignment in the college, and it has same topic with your article. Thanks, great share. Watch Warcraft Online, Watch Suicide Squad Online
      [ Show thread ] [ Reply to this comment ] [ Mod Up ] [ Mod Down ]

      Re: xterm(1) now UTF-8 by default (mod -1/25)
    by &#1575;&#1604;&#1593;&#1575;&#1576; (66.85.185.78) (ztmayto4o@moakt.ws) on Mon Aug 22 11:28:07 2016 (GMT)
    &#1576;&#1606;&#1575;&#1578;
      إن ألعاب الفلاش تعرف تطورا كبيرا، خصوصا في مجال الجرافيك والأداء، لقد أصبح الإهتمام بهندسة الصورة من الأولويات، إضافة إلى البحث عن الإمتاع في اللعبة، وهذا ما ستلمسه في لعبة خرجت سنة 2016 وهي لعبة الدبابة المدمرة، إحدى روائع موقع العاب سيارات الحربية. العاب سيارات لعب العاب سيارات 2017 al3ab العاب تلبيس بنات العاب باربي العاب فلاش العاب عربيات
      [ Show thread ] [ Reply to this comment ] [ Mod Up ] [ Mod Down ]

      Re: xterm(1) now UTF-8 by default (mod -3/25)
    by jeuxbanat (178.62.31.125) (admin@al3ab.com) on Fri Aug 26 22:51:54 2016 (GMT)
      محبى العاب بنات نقدم لكم اللعبة المميزة وهى لعبة سير استيل المميزة والتى تمكنك من تلبيس بنات بطريقة الشير المميزة والرائعة والجذابة
      [ Show thread ] [ Reply to this comment ] [ Mod Up ] [ Mod Down ]

      Re: xterm(1) now UTF-8 by default (mod 2/18)
    by mxffiles (218.11.237.74) on Tue Feb 7 05:27:39 2017 (GMT)
      This is a very good post which I really enjoy reading. It is not every day that I have the possibility to see something like this. Software mxf Software mxf converter free download to convert HD camcorder files. ts converter convert ts video files to avi, mp4, wmv, mov mts to avi mp4 mov mkv iMovie, FCP/FCE with mts converter, so to convert mts files for your PC and mobiles. mod converter and convert tod files just free download mod video converter.
      [ Show thread ] [ Reply to this comment ] [ Mod Up ] [ Mod Down ]

    [ Home | Add Story | Archives | Polls | About ]

    Copyright © 2004-2008 Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to April 2nd 2004 as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. Some icons from slashdot.org used with permission from Kathleen. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. Search engine is ht://Dig. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]