Boost mailing page: Re: [boost] [General] Always treat std::strings as UTF-8

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Peter Dimov (pdimov_at_[hidden])
Date: 2011-01-16 12:24:02

Next message: Steven Watanabe: "Re: [boost] [interhreads] Could not build the documentation"
Previous message: Frédéric Bron: "Re: [boost] Which libraries on the review schedule are ready for review?"
In reply to: Mathias Gaunard: "Re: [boost] [General] Always treat std::strings as UTF-8"
Next in thread: Alexander Lamaison: "Re: [boost] [General] Always treat std::strings as UTF-8? (was [Process] List of small issues)"

Mathias Gaunard wrote:

> POSIX system calls expect the text they receive as char* to be encoded in
> the current character locale.

No, POSIX system calls (under most Unix OSes, except on Mac OS X) are
encoding-agnostic, they receive a null-terminated byte sequence (NTBS)
without interpreting it. On Mac OS X, file paths must be UTF-8. Locales are
not considered.

> To write cross-platform code, you need to convert your UTF-8 input to the
> locale encoding when calling system calls, and convert text you receive
> from those system calls from the locale encoding to UTF-8.

This is one possible way to do it (blindly using UTF-8 is another). Strictly
speaking, under an encoding-agnostic file system, you must not convert
anything to anything because this may cause you to irretrievably lose the
original path. For display purposes, of course, you have to pick an encoding
somehow. There is no "current" character locale on Unix, by the way, unless
you count the environment variables. The OS itself doesn't care.

Using the current C locale (LANG=...) allows you to display the file names
the same way the 'ls' command does, whereas using UTF-8 allows your user to
enter file names which are not representable in the LANG locale.

> Windows is exactly the same, except it's got two sets of locales and two
> sets of system calls.

Nope. It doesn't have two sets of locales.

> So your technique for writing independent code is relying on the user to
> use an UTF-8 locale?

More or less. The code itself doesn't depend on the user locale, it always
works, but to see the actual names in a terminal, you need an UTF-8 locale.
This is now the recommended setup on all Unix OSes.

Next message: Steven Watanabe: "Re: [boost] [interhreads] Could not build the documentation"
Previous message: Frédéric Bron: "Re: [boost] Which libraries on the review schedule are ready for review?"
In reply to: Mathias Gaunard: "Re: [boost] [General] Always treat std::strings as UTF-8"
Next in thread: Alexander Lamaison: "Re: [boost] [General] Always treat std::strings as UTF-8? (was [Process] List of small issues)"

Date view	Thread view	Subject view	Author view