Contributed by maxime on from the charset-for-IRC-junkies dept.
On July 27th, Stefan Sperling (stsp@) added support for the multi-byte characters in the OpenBSD libc. Thanks to the work of the people involved in its development, the OpenBSD C library now supports the Unicode character encoding scheme UTF-8. Read on for the full commit message, some words from Stefan about what needs to be tested and how to do so:
From: Stefan Sperling
To: email@example.com Subject: CVS: cvs.openbsd.org: src Date: Tue, 27 Jul 2010 10:59:04 -0600 (MDT) CVSROOT: /cvs Module name: src Changes by: firstname.lastname@example.org 2010/07/27 10:59:04 Modified files: distrib/special/libstubs: Makefile lib/libc : Makefile.inc lib/libc/citrus: citrus_ctype.h citrus_ctype_local.h lib/libc/locale: Makefile.inc runetable.c setrunelocale.c share/locale/ctype: Makefile Added files: distrib/special/libstubs: mbrtowc_sb.c lib/libc/citrus: Makefile.inc citrus_ctype.c citrus_none.c citrus_none.h citrus_utf8.c citrus_utf8.h lib/libc/locale: btowc.c mblen.c mbrlen.c mbstowcs.c mbtowc.c multibyte.h multibyte_citrus.c wcscoll.c wcstombs.c wcsxfrm.c wctob.c wctomb.c Removed files: lib/libc/locale: mbrtowc_sb.c multibyte_sb.c Log message: Replace the single-byte placeholders for the multi-byte/wide-character conversion interfaces of libc (mbrtowc(3) and friends) with new implementations that internally call an API based on NetBSD's citrus. This allows us to support locales with multi-byte character encodings. Provide two implementations of the citrus-based API: one based on the old single-byte placeholders for use with our existing single-byte character locales (C, ISO8859-*, KOI8, CP1251, etc.), and one that provides support for UTF-8 encoded characters (code based on FreeBSD's implementation). Install the en_US.UTF-8 ctype locale support file, and allow the UTF-8 ctype locale to be enabled via setlocale(3) (export LC_CTYPE='en_US.UTF-8'). A lot of programs, especially from ports, will now start using UTF-8 if the UTF-8 locale is enabled. Use at your own risk, and please report any breakage. Note that ncurses-based programs cannot display UTF-8 right now, this is being worked on. To prevent install media growth, add vfprintf(3) and mbrtowc(3) to libstubs. The mbrtowc stub was copied unchanged from its old single-byte placeholder. vfprintf.c doesn't need to be copied, just put in .PATH (hint by fgsch@). Testing by myself, naddy, sthen, nicm, espie, armani, Dmitrij D. Czarkoff. ok matthieu espie millert sthen nicm deraadt
From: Christian Weisgerber
To: email@example.com Subject: CVS: cvs.openbsd.org: ports Date: Wed, 28 Jul 2010 14:25:11 -0600 (MDT) CVSROOT: /cvs Module name: ports Changes by: firstname.lastname@example.org 2010/07/28 14:25:11 Modified files: shells/bash : Makefile Log message: Enable multibyte support. Makes regression tests happier.
People might want to read the mbrtowc(3) and vfprintf(3) man pages. As always, users are invited to test, and to report any bug. Stefan also provided undeadly with some notes on the testing that is required for this particular change:
So, this is work in progress, with the very first step being completed. Marc Espie (espie@) tell us more about what still needs to be done:My commit only provided foundations for UTF-8 support. It makes a lot of things work, but there are still many pieces in the system which need to be tweaked in order to make proper use of the UTF-8 support in libc. It is unlikely that much of the higher-layer stuff will be enabled for 4.8. We're already at ABI lock. But shipping 4.8 with the fundamentals built-in means that the fundamentals can easily be tested by a lot of people to spot fallout. It makes more sense to deal with the higher layers during the 4.9 cycle, because it is a lot of work. Obviously, to use UTF-8 a terminal capable of displaying UTF-8 is needed. Right now, and maybe forever, the only option is an X11 terminal emulator like xterm(1), or the various Gnome/KDE/XFCE terminal emulators. A suitable font that contains a lot of Unicode characters is also required. The default 'fixed' font of xterm(1) includes quite a lot of characters. I'm personally using a DejaVu font instead, which is included in xenocara. In ~/.Xdefaults: XTerm*Font: -Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1 Here is a screenshot of xterm(1) using this font to display Markus Kuhn's UTF-8 demo test file. The screenshot also shows that tmux(1) is ready for UTF-8. It would be very hard to make the text console support UTF-8 display. That is certainly out-of-scope for me. I won't touch that. It's definitely not a good idea to run the entire system with LC_CTYPE=en_US.UTF-8 in the environment. It's a bad idea to set LC_CTYPE=en_US.UTF-8 in ~/.profile, because it could cause gibberish being displayed on the text console. I run my entire X session with LC_CTYPE=en_US.UTF-8, like this: $ cat ~/.xsession env LC_CTYPE="en_US.UTF-8" /usr/local/bin/startxfce4 I'd recommend doing it this way when helping with testing. The most important thing to look out for is stuff that used to work with single-byte character sets like ISO8859-1 but does not work with UTF-8. In environments with high stability requirements, the UTF-8 locale should not be used at all. The UTF-8 locale can also be used with specific applications only, by starting the applications from uxterm(1) and using a non-UTF8 locale for the rest of the xsession. Hint: With mutt from ports, the -slang flavour is required for UTF-8 to work because of the current limitations in ncurses. Also note that the only UTF-8 locale we currently have is en_US.UTF-8. There is no de_DE.UTF-8, fr_FR.UTF-8, etc. This might be inconvenient for people relying on localisation of program messages.
Independently of other libraries, there are also lots of wide-char functions AND locale stuff which we don't yet have, such as wprintf or strcoll support. Until we have these, a lot of software will simply not pick up any locale support during the configure steps. cursesw is the most visible "next step", but by no means is it the only one...
(Comments are closed)